@channel **Hi Everyone,**

**2023-09-21 `04.2-Data-Analysis-Exploring Pandas`**

Day 1 of Pandas was really great. Everyone seemed to be sailing right along. On Day 2, we will keep the learning coming with topics on selecting data from Pandas DataFrames, filtering, grouping, aggregating and sorting.

**Objectives**
* Navigate through DataFrames using `Loc` and `Iloc`.
* `Filter` and `slice` Pandas DataFrames.
* Create and access Pandas `GroupBy` objects.
* `Sort` DataFrames.

**Slideshows**
* [04.2-Data-Analysis-Exploring Pandas](https://git.bootcampcontent.com/University-of-California---Berkeley/UCB-VIRT-DATA-PT-08-2023-U-LOLC/-/blob/main/Slides/Data-04.2-Exploring_Pandas.pdf)

**Resources**
* [loc and iloc tutorial](https://www.analyticsvidhya.com/blog/2020/02/loc-iloc-pandas/)
* [Deleting from a dataframe](https://pythonexamples.org/pandas-dataframe-delete-column/)
* [GREAT - short article on grouping and sorting](https://medium.com/datamadeeasy/pandas-made-easy-groupby-65e4e3c26a6)

**Install**
* `pip install openpyxl`

**Best wishes.**

# ==========================================

### 2.01 Instructor Do: Exploring Data With Loc and Iloc (10 min)

In [1]:
# Dependencies
import pandas as pd
from pathlib import Path

In [2]:
# Store filepath in a variable
file = Path("01-Ins_LocAndIloc/Solved/Resources/baton_streets.csv")

In [3]:
# Read our data file with the Pandas library
original_df = pd.read_csv(file)
original_df.head()

Unnamed: 0,STREET NAME ID,STREET NAME,STREET FULL NAME,POSTAL COMMUNITY,MUNICIPAL COMMUNITY
0,1400342,PRIVATE STREET,PRIVATE STREET,BATON ROUGE,BATON ROUGE
1,1,4TH,N 4TH ST,BATON ROUGE,BATON ROUGE
2,10,11TH,S 11TH ST,BATON ROUGE,BATON ROUGE
3,100,ADDINGTON,ADDINGTON AVE,BATON ROUGE,BATON ROUGE
4,1000,CHALFONT,W CHALFONT DR,BATON ROUGE,PARISH


In [4]:
# Set new index to STREET NAME
df = original_df.set_index("STREET NAME")
df.head()

Unnamed: 0_level_0,STREET NAME ID,STREET FULL NAME,POSTAL COMMUNITY,MUNICIPAL COMMUNITY
STREET NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
PRIVATE STREET,1400342,PRIVATE STREET,BATON ROUGE,BATON ROUGE
4TH,1,N 4TH ST,BATON ROUGE,BATON ROUGE
11TH,10,S 11TH ST,BATON ROUGE,BATON ROUGE
ADDINGTON,100,ADDINGTON AVE,BATON ROUGE,BATON ROUGE
CHALFONT,1000,W CHALFONT DR,BATON ROUGE,PARISH


In [5]:
# Get the data contained within the "ADDINGTON" row and the "STREET FULL NAME" column
addington_name = df.loc["ADDINGTON", "STREET FULL NAME"]
print("Using loc: " + addington_name)

also_addington_name = df.iloc[3, 1]
print("Using iloc: " + also_addington_name)

Using loc: ADDINGTON AVE
Using iloc: ADDINGTON AVE


In [6]:
# Get the first five rows of data and the columns from "STREET NAME ID" to "POSTAL COMMUNITY"
# The problem with using "STREET NAME" as the index is that the values are not unique so duplicates are returned
# If there are duplicates and loc is being used, Pandas will return an error
private_to_chalfont = df.loc[["PRIVATE STREET", "4TH", "11TH", "ADDINGTON", 
                              "CHALFONT"], ["STREET NAME ID", "STREET FULL NAME", "POSTAL COMMUNITY"]]
print(private_to_chalfont)

print()

# Using iloc will not find duplicates since a numeric index is always unique
also_private_to_chalfont = df.iloc[0:5, 0:3]
print(also_private_to_chalfont)

                STREET NAME ID STREET FULL NAME POSTAL COMMUNITY
STREET NAME                                                     
PRIVATE STREET         1400342   PRIVATE STREET      BATON ROUGE
PRIVATE STREET         1400001   PRIVATE STREET      BATON ROUGE
PRIVATE STREET         1400015   PRIVATE STREET      BATON ROUGE
PRIVATE STREET         1400161   PRIVATE STREET      BATON ROUGE
PRIVATE STREET         1400343   PRIVATE STREET      BATON ROUGE
...                        ...              ...              ...
11TH                         9        N 11TH ST      BATON ROUGE
ADDINGTON                  100    ADDINGTON AVE      BATON ROUGE
CHALFONT                  1000    W CHALFONT DR      BATON ROUGE
CHALFONT                   998    N CHALFONT DR      BATON ROUGE
CHALFONT                   999    S CHALFONT DR      BATON ROUGE

[329 rows x 3 columns]

                STREET NAME ID STREET FULL NAME POSTAL COMMUNITY
STREET NAME                                                     


In [7]:
# Using loc to select all rows for the `STREET FULL NAME` and `POSTAL COMMUNITY` columns.
df.loc[:, ["STREET FULL NAME", "POSTAL COMMUNITY"]].head()

Unnamed: 0_level_0,STREET FULL NAME,POSTAL COMMUNITY
STREET NAME,Unnamed: 1_level_1,Unnamed: 2_level_1
PRIVATE STREET,PRIVATE STREET,BATON ROUGE
4TH,N 4TH ST,BATON ROUGE
11TH,S 11TH ST,BATON ROUGE
ADDINGTON,ADDINGTON AVE,BATON ROUGE
CHALFONT,W CHALFONT DR,BATON ROUGE


In [8]:
# Using iloc to select all rows for the second and third columns.
df.iloc[:, 1:3].head()

Unnamed: 0_level_0,STREET FULL NAME,POSTAL COMMUNITY
STREET NAME,Unnamed: 1_level_1,Unnamed: 2_level_1
PRIVATE STREET,PRIVATE STREET,BATON ROUGE
4TH,N 4TH ST,BATON ROUGE
11TH,S 11TH ST,BATON ROUGE
ADDINGTON,ADDINGTON AVE,BATON ROUGE
CHALFONT,W CHALFONT DR,BATON ROUGE


In [9]:
# The following logic test/conditional statement returns a series of boolean values
municipal_parish = df["MUNICIPAL COMMUNITY"] == "PARISH"
municipal_parish.head()

STREET NAME
PRIVATE STREET    False
4TH               False
11TH              False
ADDINGTON         False
CHALFONT           True
Name: MUNICIPAL COMMUNITY, dtype: bool

In [10]:
# loc will allow for conditional statements to filter rows of data
# Using loc on the logic test above only returns rows where the result is True
only_prairieville = df.loc[df["POSTAL COMMUNITY"] == "PRAIRIEVILLE", :]
print(only_prairieville)

                 STREET NAME ID    STREET FULL NAME POSTAL COMMUNITY  \
STREET NAME                                                            
ALLIGATOR BAYOU           16497  ALLIGATOR BAYOU RD     PRAIRIEVILLE   
BLUFF                     16498            BLUFF RD     PRAIRIEVILLE   

                MUNICIPAL COMMUNITY  
STREET NAME                          
ALLIGATOR BAYOU              PARISH  
BLUFF                        PARISH  


In [11]:
# iloc will also allow for conditional statements to filter rows of data
# Using iloc on the logic test above only returns rows where the result is True
also_only_prairieville = df[df.iloc[:,2] == "PRAIRIEVILLE"]
print(also_only_prairieville)

                 STREET NAME ID    STREET FULL NAME POSTAL COMMUNITY  \
STREET NAME                                                            
ALLIGATOR BAYOU           16497  ALLIGATOR BAYOU RD     PRAIRIEVILLE   
BLUFF                     16498            BLUFF RD     PRAIRIEVILLE   

                MUNICIPAL COMMUNITY  
STREET NAME                          
ALLIGATOR BAYOU              PARISH  
BLUFF                        PARISH  


In [12]:
# Multiple conditions can be set to narrow down or widen the filter
only_prairieville_and_jackson = df.loc[(df["POSTAL COMMUNITY"] == "PRAIRIEVILLE") | (
    df["POSTAL COMMUNITY"] == "JACKSON"), :]
print(only_prairieville_and_jackson)

                  STREET NAME ID      STREET FULL NAME POSTAL COMMUNITY  \
STREET NAME                                                               
TALMADGE                    4772           TALMADGE DR          JACKSON   
TREAKLE                     4911            TREAKLE DR          JACKSON   
DENNIS                      1452             DENNIS CT          JACKSON   
ALLIGATOR BAYOU            16497    ALLIGATOR BAYOU RD     PRAIRIEVILLE   
BLUFF                      16498              BLUFF RD     PRAIRIEVILLE   
RENEE                       4072              RENEE CT          JACKSON   
SANDY SPRINGS               4320      SANDY SPRINGS LN          JACKSON   
SHANE                       4405              SHANE CT          JACKSON   
BICKHAM                      518            BICKHAM RD          JACKSON   
ADAMS                       5527              ADAMS LN          JACKSON   
LA 68                       5838             LA 68 HWY          JACKSON   
SIMMONS                  

# ==========================================

### 2.02 Students Do: Good Movies (15 min)

# Good Movies

In this activity, you will create an application that searches through IMDb data to find only the best movies out there.

## Instructions

* Use Pandas to load and display the CSV provided in `Resources`.

* List all the columns in the dataset.

* We're only interested in IMDb data, so create a new table that takes the film and all the columns related to IMDb.

* Filter out only the good movies&mdash;any film with an IMDb score greater than or equal to 7&mdash;and remove the norm ratings.

* Find less popular movies that you may not have heard about&mdash;anything with under 20K votes.

* Finally, export this file to a spreadsheet, excluding the index, so we can keep track of our future watchlist.

## References

[Moving Rating Dataset](https://github.com/fivethirtyeight/data/blob/master/fandango/fandango_score_comparison.csv)

---

In [13]:
# Dependencies
import pandas as pd
from pathlib import Path

In [117]:
# Load in file
# Store filepath in a variable
movie_file = Path("02-Stu_GoodMovies_Loc/Unsolved/Resources/movie_scores.csv")

In [118]:
# Read and display the CSV with Pandas
movie_file_df = pd.read_csv(movie_file)
movie_file_df.head()

Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
0,Avengers: Age of Ultron (2015),74,86,66,7.1,7.8,5.0,4.5,3.7,4.3,...,3.9,3.5,4.5,3.5,3.5,4.0,1330,271107,14846,0.5
1,Cinderella (2015),85,80,67,7.5,7.1,5.0,4.5,4.25,4.0,...,3.55,4.5,4.0,3.5,4.0,3.5,249,65709,12640,0.5
2,Ant-Man (2015),80,90,64,8.1,7.8,5.0,4.5,4.0,4.5,...,3.9,4.0,4.5,3.0,4.0,4.0,627,103660,12055,0.5
3,Do You Believe? (2015),18,84,22,4.7,5.4,5.0,4.5,0.9,4.2,...,2.7,1.0,4.0,1.0,2.5,2.5,31,3136,1793,0.5
4,Hot Tub Time Machine 2 (2015),14,28,29,3.4,5.1,3.5,3.0,0.7,1.4,...,2.55,0.5,1.5,1.5,1.5,2.5,88,19560,1021,0.5


In [119]:
# List all the columns in the table
movie_file_df.columns

Index(['FILM', 'RottenTomatoes', 'RottenTomatoes_User', 'Metacritic',
       'Metacritic_User', 'IMDB', 'Fandango_Stars', 'Fandango_Ratingvalue',
       'RT_norm', 'RT_user_norm', 'Metacritic_norm', 'Metacritic_user_nom',
       'IMDB_norm', 'RT_norm_round', 'RT_user_norm_round',
       'Metacritic_norm_round', 'Metacritic_user_norm_round',
       'IMDB_norm_round', 'Metacritic_user_vote_count', 'IMDB_user_vote_count',
       'Fandango_votes', 'Fandango_Difference'],
      dtype='object')

In [120]:
# We only want IMDb data, so create a new table that takes the Film and all the columns relating to IMDB
imdb_df = movie_file_df[["FILM", "IMDB", "IMDB_norm",
                            "IMDB_norm_round", "IMDB_user_vote_count"]]
imdb_df.head()

Unnamed: 0,FILM,IMDB,IMDB_norm,IMDB_norm_round,IMDB_user_vote_count
0,Avengers: Age of Ultron (2015),7.8,3.9,4.0,271107
1,Cinderella (2015),7.1,3.55,3.5,65709
2,Ant-Man (2015),7.8,3.9,4.0,103660
3,Do You Believe? (2015),5.4,2.7,2.5,3136
4,Hot Tub Time Machine 2 (2015),5.1,2.55,2.5,19560


In [121]:
# We only like good movies, so find those that scored over 7, and ignore the norm rating
good_movies_df = movie_file_df.loc[movie_file_df["IMDB"] > 7, [
    "FILM", "IMDB", "IMDB_user_vote_count"]]
good_movies_df.head()

Unnamed: 0,FILM,IMDB,IMDB_user_vote_count
0,Avengers: Age of Ultron (2015),7.8,271107
1,Cinderella (2015),7.1,65709
2,Ant-Man (2015),7.8,103660
5,The Water Diviner (2015),7.2,39373
8,Shaun the Sheep Movie (2015),7.4,12227


In [122]:
# Find less popular movies--i.e., those with fewer than 20K votes
unknown_movies_df = good_movies_df.loc[good_movies_df["IMDB_user_vote_count"] < 20000, [
    "FILM", "IMDB", "IMDB_user_vote_count"]]
unknown_movies_df.head()

Unnamed: 0,FILM,IMDB,IMDB_user_vote_count
8,Shaun the Sheep Movie (2015),7.4,12227
9,Love & Mercy (2015),7.8,5367
10,Far From The Madding Crowd (2015),7.2,12129
20,"McFarland, USA (2015)",7.5,13769
29,The End of the Tour (2015),7.9,1320


In [123]:
# Finally, export this file to a spread so we can keep track of out new future watch list without the index
unknown_movies_df.to_excel("02-Stu_GoodMovies_Loc/Solved/output/movieWatchlist.xlsx", index=False)

# ==========================================

### 2.03 Instructor Do: Cleaning Data (5 min)

In [124]:
# Dependencies
import pandas as pd
from pathlib import Path

In [125]:
# Name of the CSV file
file = Path('03-Ins_Cleaning_Data/Solved/Resources/donors2021_unclean.csv')

In [126]:
# The correct encoding must be used to read the CSV in pandas
df = pd.read_csv(file, encoding="ISO-8859-1")

In [127]:
# Preview of the DataFrame
# Note that Memo_CD is likely a meaningless column
df.head()

Unnamed: 0,Name,Employer,City,State,Zip,Amount,Memo_CD
0,"CAREY, JAMES",NOT EMPLOYED,HOCKESSIN,DE,197071618.0,500,
1,"OBICI, SILVANA",STONY BROOK,PORT JEFFERSON STATION,NY,117764286.0,250,
2,"MAISLIN, KAREN",RETIRED,WILLIAMSVILLE,NY,14221.0,250,
3,"MCCLELLAND, CARTER AND STEPHANIE",UNION SQUARE ADVISORS,NEW YORK,NY,10023.0,1000,
4,"MCCLUSKEY, MARTHA",STATE UNIVERSITY OF NEW YORK,BUFFALO,NY,14214.0,250,


In [128]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      2000 non-null   object 
 1   Employer  1820 non-null   object 
 2   City      1999 non-null   object 
 3   State     1999 non-null   object 
 4   Zip       1996 non-null   float64
 5   Amount    2000 non-null   int64  
 6   Memo_CD   0 non-null      float64
dtypes: float64(2), int64(1), object(4)
memory usage: 109.5+ KB


In [148]:
df['Memo_CD'] = ""
df['Memo_CD1'] = ""

In [150]:
# Delete extraneous column
del df['Memo_CD'], df['Memo_CD1']
df.head()

Unnamed: 0,Name,Employer,City,State,Zip,Amount
0,"CAREY, JAMES",NOT EMPLOYED,HOCKESSIN,DE,197071618.0,500
1,"OBICI, SILVANA",STONY BROOK,PORT JEFFERSON STATION,NY,117764286.0,250
2,"MAISLIN, KAREN",RETIRED,WILLIAMSVILLE,NY,14221.0,250
3,"MCCLELLAND, CARTER AND STEPHANIE",UNION SQUARE ADVISORS,NEW YORK,NY,10023.0,1000
4,"MCCLUSKEY, MARTHA",STATE UNIVERSITY OF NEW YORK,BUFFALO,NY,14214.0,250


In [133]:
# Identify incomplete rows
df.count()

Name        1818
Employer    1818
City        1818
State       1818
Zip         1818
Amount      1818
dtype: int64

In [134]:
# Drop all rows with missing information
df = df.dropna(how='any')

In [135]:
# Verify dropped rows
df.count()

Name        1818
Employer    1818
City        1818
State       1818
Zip         1818
Amount      1818
dtype: int64

In [136]:
# The Zip column is the wrong data type. It should be a string (object).
df.dtypes

Name         object
Employer     object
City         object
State        object
Zip         float64
Amount        int64
dtype: object

In [137]:
df["Zip"] = df["Zip"].astype(str)

In [139]:
# Use df.astype() method to convert the datatype of the Zip column
df = df.astype({"Zip": str}, errors='raise')

In [140]:
# Verify that the Zip column datatype has been made an object
df['Zip'].dtype

dtype('O')

In [141]:
# Display an overview of the Employers column
df['Employer'].value_counts()

NOT EMPLOYED                        609
NONE                                321
SELF-EMPLOYED                       132
SELF                                 33
RETIRED                              32
                                   ... 
NOKIA CORP                            1
FH MINE SUPPLY INC.                   1
DREYER INTERNATIONAL ACADEMY LLC      1
RAY GRAHAM ASSOCIATION                1
5T WEALTH, LLC                        1
Name: Employer, Length: 519, dtype: int64

In [142]:
# Clean up Employer category. Replace 'SELF' and 'SELF EMPLOYED' with 'SELF-EMPLOYED'
df['Employer'] = df['Employer'].replace(
    {
        'SELF': 'SELF-EMPLOYED',
        'SELF EMPLOYED': 'SELF-EMPLOYED'
    }
)

In [143]:
# Verify clean-up.
df['Employer'].value_counts()

NOT EMPLOYED                            609
NONE                                    321
SELF-EMPLOYED                           180
RETIRED                                  32
INGRAM BARGE COMPANY                     30
                                       ... 
GOOGLE LLC                                1
BP INDUSTRIES INC                         1
HOT SPRINGS COUNTY DISTRICT HOSPITAL      1
INVEST AMERICA REALTY                     1
5T WEALTH, LLC                            1
Name: Employer, Length: 517, dtype: int64

In [35]:
# Clean up Employer category. Replace 'NOT EMPLOYED' with 'UNEMPLOYED'
df['Employer'] = df['Employer'].replace({'NOT EMPLOYED': 'UNEMPLOYED'})
df['Employer'].value_counts()

UNEMPLOYED                        611
NONE                              321
SELF-EMPLOYED                     180
RETIRED                            32
INGRAM BARGE COMPANY               30
                                 ... 
JEROME'S COLLISION CENTER           1
LINDQUIST MORTIARIES                1
GAINESVILLE SKIN CANCER CENTER      1
RYAN SPECIALTYGROUP                 1
5T WEALTH, LLC                      1
Name: Employer, Length: 516, dtype: int64

In [144]:
# Display a statistical overview
# We can infer the maximum allowable individual contribution from 'max'
df.describe()

Unnamed: 0,Amount
count,1818.0
mean,752.127613
std,11601.791128
min,-1000.0
25%,25.0
50%,50.0
75%,200.0
max,400000.0


In [145]:
# Save the DataFrame to a CSV file. 
df.to_csv("03-Ins_Cleaning_Data/Solved/Resources/donors2021.csv", index=False, encoding="ISO-8859-1")

# ==========================================

### 2.04 Partners Do: Hong Kong LPG Appliances - Cleaning Data (15 min)

# Hong Kong LPG Appliances

In this activity, you will take an LPG appliance dataset from Hong Kong, and clean it up so that the DataFrame is consistent and does not have any rows with missing data.

## Instructions

* Read in the CSV using Pandas, and print out the DataFrame that is returned.

  * **Note:** This dataset uses Chinese characters and should be read in using UTF-8 encoding.

* Reduce the DataFrame to only the columns in English.

* Get a count of the rows within the DataFrame to determine if there are any null values.

* Drop the rows that contain null values.

* Search through the "Applicant" column, and replace any similar values with one consistent value.

* Create a couple DataFrames that look into one Place of Manufacture only, and print them to the screen.

## References

Hong Kong Electrical and Mechanical Services Department via [data.gov.hk](https://data.gov.hk) (2022). Approved List of Domestic Gas Appliances (LP Gas). [https://data.gov.hk/en-data/dataset/hk-emsd-emsd1-domestic-gas-appliances-lpg](https://data.gov.hk/en-data/dataset/hk-emsd-emsd1-domestic-gas-appliances-lpg)

---

In [38]:
# Dependencies
import pandas as pd
from pathlib import Path

In [39]:
# Reference the file where the CSV is located
lpg_csv_path = Path("04-Par_Cleaning_Appliance_Data/Resources/dga_lpg.csv")

# Import the data into a Pandas DataFrame
lpg_df = pd.read_csv(lpg_csv_path, encoding="UTF-8")
lpg_df

Unnamed: 0,Part,Type,Type.1,Brand,牌子,Model,Other Information,其他資料,Place of Manufacture,製造地點,Applicant,申請人,Telephone Number,Approval Expiry Date
0,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 1 Burner,嵌入式單頭平面爐,De Dietrich,,DTG1288XC,CERAMIC GLASS TOP PANEL,陶瓷玻璃面版,Italy,意大利,Gilman Group Limited,太平洋行國際有限公司,2418 3272,2023-09-26
1,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 1 Burner,嵌入式單頭平面爐,Electrolux,伊萊克斯,EGC-2901,CERAMIC GLASS TOP PANEL,陶瓷玻璃面版,The People's Republic of China,中華人民共和國,"Dah Chong Hong, Ltd.",大昌貿易行有限公司,2262 1690,2024-06-24
2,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 1 Burner,嵌入式單頭平面爐,Gaggenau,,VG231120F,STAINLESS STEEL TOP PANEL,不銹鋼面版,France,法國,Kitchen Infinity Corp. Ltd.,Kitchen Infinity Corp. Ltd.,2552 2208,2024-04-01
3,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 1 Burner,嵌入式單頭平面爐,Gaggenau,,VG231114F,STAINLESS STEEL TOP PANEL,不銹鋼面版,France,法國,Kitchen Infinity Corp. Ltd.,Kitchen Infinity Corp. Ltd.,2552 2208,2022-07-19
4,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 1 Burner,嵌入式單頭平面爐,GERMAN POOL,德國寶,GP12-1-LG,GLASS TOP PANEL,玻璃面版,The People's Republic of China,中華人民共和國,German Pool (Hong Kong) Ltd.,德國寶（香港）有限公司,2773 2812,2025-02-15
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
934,Part 5 – Approved Domestic Gas Appliances whic...,Room-sealed (fanned draught) gas water heater,密封式（機動排煙）氣體熱水爐,Vatti,華帝,G12L-HK,TOP FLUE,頂出煙道,The People's Republic of China,中華人民共和國,Vatti (H.K.) Limited,華帝（香港）有限公司,2620 1389,2010-09-13
935,Part 5 – Approved Domestic Gas Appliances whic...,Room-sealed (fanned draught) gas water heater,密封式（機動排煙）氣體熱水爐,Vatti,華帝,G16L-HK,TOP FLUE,頂出煙道,The People's Republic of China,中華人民共和國,Vatti (H.K.) Limited,華帝（香港）有限公司,2620 1389,2010-09-13
936,Part 5 – Approved Domestic Gas Appliances whic...,Room-sealed (natural draught) gas water heater,密封式（自然排煙）氣體熱水爐,Junkers,真佳,J10BF1,NOT APPLICABLE,不適用,Portugal,葡萄牙,"Jebsen & Co., Ltd.",捷成洋行有限公司,2923 8440,2005-06-06
937,Part 5 – Approved Domestic Gas Appliances whic...,Room-sealed (natural draught) gas water heater,密封式（自然排煙）氣體熱水爐,Junkers,真佳,WR225-1AV1B31S2704,NOT APPLICABLE,不適用,Portugal,葡萄牙,"Jebsen & Co., Ltd.",捷成洋行有限公司,2923 8440,2005-06-11


In [40]:
# Reduce to columns that are in English
lpg_reduced_columns = lpg_df[["Part", "Type", "Brand", "Model", "Other Information", "Place of Manufacture", 
                              "Applicant", "Telephone Number", "Approval Expiry Date"]]
lpg_reduced_columns

Unnamed: 0,Part,Type,Brand,Model,Other Information,Place of Manufacture,Applicant,Telephone Number,Approval Expiry Date
0,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 1 Burner,De Dietrich,DTG1288XC,CERAMIC GLASS TOP PANEL,Italy,Gilman Group Limited,2418 3272,2023-09-26
1,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 1 Burner,Electrolux,EGC-2901,CERAMIC GLASS TOP PANEL,The People's Republic of China,"Dah Chong Hong, Ltd.",2262 1690,2024-06-24
2,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 1 Burner,Gaggenau,VG231120F,STAINLESS STEEL TOP PANEL,France,Kitchen Infinity Corp. Ltd.,2552 2208,2024-04-01
3,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 1 Burner,Gaggenau,VG231114F,STAINLESS STEEL TOP PANEL,France,Kitchen Infinity Corp. Ltd.,2552 2208,2022-07-19
4,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 1 Burner,GERMAN POOL,GP12-1-LG,GLASS TOP PANEL,The People's Republic of China,German Pool (Hong Kong) Ltd.,2773 2812,2025-02-15
...,...,...,...,...,...,...,...,...,...
934,Part 5 – Approved Domestic Gas Appliances whic...,Room-sealed (fanned draught) gas water heater,Vatti,G12L-HK,TOP FLUE,The People's Republic of China,Vatti (H.K.) Limited,2620 1389,2010-09-13
935,Part 5 – Approved Domestic Gas Appliances whic...,Room-sealed (fanned draught) gas water heater,Vatti,G16L-HK,TOP FLUE,The People's Republic of China,Vatti (H.K.) Limited,2620 1389,2010-09-13
936,Part 5 – Approved Domestic Gas Appliances whic...,Room-sealed (natural draught) gas water heater,Junkers,J10BF1,NOT APPLICABLE,Portugal,"Jebsen & Co., Ltd.",2923 8440,2005-06-06
937,Part 5 – Approved Domestic Gas Appliances whic...,Room-sealed (natural draught) gas water heater,Junkers,WR225-1AV1B31S2704,NOT APPLICABLE,Portugal,"Jebsen & Co., Ltd.",2923 8440,2005-06-11


In [41]:
# Look for missing values
lpg_reduced_columns.count()

Part                    939
Type                    939
Brand                   939
Model                   939
Other Information       936
Place of Manufacture    939
Applicant               939
Telephone Number        939
Approval Expiry Date    939
dtype: int64

In [42]:
# Drop null rows
no_null_lpg_df = lpg_reduced_columns.dropna(how='any')

In [43]:
# Verify counts
no_null_lpg_df.count()

Part                    936
Type                    936
Brand                   936
Model                   936
Other Information       936
Place of Manufacture    936
Applicant               936
Telephone Number        936
Approval Expiry Date    936
dtype: int64

In [44]:
# List unique values of "Applicant" to locate any that may be the same
no_null_lpg_df["Applicant"].unique()

array(['Gilman Group Limited', 'Dah Chong Hong, Ltd.',
       'Kitchen Infinity Corp. Ltd.', 'German Pool (Hong Kong) Ltd.',
       'World Engineering Limited', 'Hibachi Gas Cooker Limited',
       'Whirlpool (Hong Kong) Ltd.', 'Miele (Hong Kong) Limited',
       'Dong Woo Industrial Co. Ltd.', 'BSH Home Appliances Ltd.',
       'The Union Gas Appliances (Holdings) Ltd.',
       'Whampo Trading Limited', 'Sunny Eternal Limited',
       'Toptech Co. Limited', 'D & A Electronics Co., Ltd.',
       'Fidelity (Far East) Trading Co., Ltd.',
       'Araytron Technology Limited', 'Charm Vantage Limited',
       'Lighting Gas Stoves Trading Ltd', 'Lighting (Japan) Trading Ltd.',
       'Homepro International Limited',
       'The Hong Kong & China Gas Co., Ltd.',
       'Energy Trading Company Limited',
       'Wealthy Link International Trading Limited',
       'Wetol Company Limited',
       'Crown Gas Stoves (Holdings) Company Limited',
       'Iwatani Corporation (Hong Kong) Limited',
    

In [45]:
# Combine similar applicants together
no_null_lpg_df = no_null_lpg_df.replace(
    {"Crown Gas Stoves (Holdings) Company Limited": "Crown Gas Stoves Co., Ltd.", 
     "Sun Kee LP Gas Co.": "Sun Kee LP Gas Co. Limited"})
no_null_lpg_df

Unnamed: 0,Part,Type,Brand,Model,Other Information,Place of Manufacture,Applicant,Telephone Number,Approval Expiry Date
0,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 1 Burner,De Dietrich,DTG1288XC,CERAMIC GLASS TOP PANEL,Italy,Gilman Group Limited,2418 3272,2023-09-26
1,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 1 Burner,Electrolux,EGC-2901,CERAMIC GLASS TOP PANEL,The People's Republic of China,"Dah Chong Hong, Ltd.",2262 1690,2024-06-24
2,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 1 Burner,Gaggenau,VG231120F,STAINLESS STEEL TOP PANEL,France,Kitchen Infinity Corp. Ltd.,2552 2208,2024-04-01
3,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 1 Burner,Gaggenau,VG231114F,STAINLESS STEEL TOP PANEL,France,Kitchen Infinity Corp. Ltd.,2552 2208,2022-07-19
4,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 1 Burner,GERMAN POOL,GP12-1-LG,GLASS TOP PANEL,The People's Republic of China,German Pool (Hong Kong) Ltd.,2773 2812,2025-02-15
...,...,...,...,...,...,...,...,...,...
934,Part 5 – Approved Domestic Gas Appliances whic...,Room-sealed (fanned draught) gas water heater,Vatti,G12L-HK,TOP FLUE,The People's Republic of China,Vatti (H.K.) Limited,2620 1389,2010-09-13
935,Part 5 – Approved Domestic Gas Appliances whic...,Room-sealed (fanned draught) gas water heater,Vatti,G16L-HK,TOP FLUE,The People's Republic of China,Vatti (H.K.) Limited,2620 1389,2010-09-13
936,Part 5 – Approved Domestic Gas Appliances whic...,Room-sealed (natural draught) gas water heater,Junkers,J10BF1,NOT APPLICABLE,Portugal,"Jebsen & Co., Ltd.",2923 8440,2005-06-06
937,Part 5 – Approved Domestic Gas Appliances whic...,Room-sealed (natural draught) gas water heater,Junkers,WR225-1AV1B31S2704,NOT APPLICABLE,Portugal,"Jebsen & Co., Ltd.",2923 8440,2005-06-11


In [46]:
# Check to see if you combined similar applicants correctly in "Applicant"
no_null_lpg_df["Applicant"].unique()

array(['Gilman Group Limited', 'Dah Chong Hong, Ltd.',
       'Kitchen Infinity Corp. Ltd.', 'German Pool (Hong Kong) Ltd.',
       'World Engineering Limited', 'Hibachi Gas Cooker Limited',
       'Whirlpool (Hong Kong) Ltd.', 'Miele (Hong Kong) Limited',
       'Dong Woo Industrial Co. Ltd.', 'BSH Home Appliances Ltd.',
       'The Union Gas Appliances (Holdings) Ltd.',
       'Whampo Trading Limited', 'Sunny Eternal Limited',
       'Toptech Co. Limited', 'D & A Electronics Co., Ltd.',
       'Fidelity (Far East) Trading Co., Ltd.',
       'Araytron Technology Limited', 'Charm Vantage Limited',
       'Lighting Gas Stoves Trading Ltd', 'Lighting (Japan) Trading Ltd.',
       'Homepro International Limited',
       'The Hong Kong & China Gas Co., Ltd.',
       'Energy Trading Company Limited',
       'Wealthy Link International Trading Limited',
       'Wetol Company Limited', 'Crown Gas Stoves Co., Ltd.',
       'Iwatani Corporation (Hong Kong) Limited',
       'Sun Kee LP Gas Co. L

In [47]:
# Create a new DataFrame that looks into a specific Place of Manufacture
malaysia_products_df = no_null_lpg_df.loc[no_null_lpg_df["Place of Manufacture"] == "Malaysia"]
malaysia_products_df

Unnamed: 0,Part,Type,Brand,Model,Other Information,Place of Manufacture,Applicant,Telephone Number,Approval Expiry Date
54,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 2 Burners,Rasonic,RG-233GB,GLASS TOP PANEL,Malaysia,Energy Trading Company Limited,2262 4138,2022-11-27
55,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 2 Burners,Rasonic,RG-233GS,GLASS TOP PANEL,Malaysia,Energy Trading Company Limited,2262 4138,2023-06-13
56,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 2 Burners,Rasonic,RG-233GW,GLASS TOP PANEL,Malaysia,Energy Trading Company Limited,2262 4138,2023-06-13
79,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 3 Burners,Rasonic,RG-323AGB,GLASS TOP PANEL,Malaysia,Energy Trading Company Limited,2262 4138,2024-06-27
80,Part 1 – Approved Domestic Gas Cooking Appliances,Built-in Hotplate 3 Burners,Rasonic,RG-323AGW,GLASS TOP PANEL,Malaysia,Energy Trading Company Limited,2262 4138,2024-06-27
109,Part 1 – Approved Domestic Gas Cooking Appliances,Free Standing Hotplate 1 Burner,Rasonic,RG-10SA,STAINLESS STEEL TOP PANEL,Malaysia,Energy Trading Company Limited,2262 4138,2024-05-02
135,Part 1 – Approved Domestic Gas Cooking Appliances,Free Standing Hotplate 2 Burners,Rasonic,RG-30S,STAINLESS STEEL TOP PANEL,Malaysia,Energy Trading Company Limited,2262 4138,2022-12-17
136,Part 1 – Approved Domestic Gas Cooking Appliances,Free Standing Hotplate 2 Burners,Rasonic,RG-32S,STAINLESS STEEL TOP PANEL,Malaysia,Energy Trading Company Limited,2262 4138,2022-12-17
282,Part 5 – Approved Domestic Gas Appliances whic...,Built-in Hotplate 2 Burners,Garwoods,GS-208,GLASS TOP PANEL,Malaysia,Famous International (H.K.) Ltd.,2139 2239,2007-12-19
365,Part 5 – Approved Domestic Gas Appliances whic...,Built-in Hotplate 2 Burners,Rasonic,RG-213GW,GLASS TOP PANEL,Malaysia,Energy Trading Company Limited,2262 4138,2013-05-26


In [48]:
# Create a new DataFrame that looks into a specific Place of Manufacture
portugal_products_df = no_null_lpg_df.loc[no_null_lpg_df["Place of Manufacture"] == "Portugal"]
portugal_products_df

Unnamed: 0,Part,Type,Brand,Model,Other Information,Place of Manufacture,Applicant,Telephone Number,Approval Expiry Date
863,Part 5 – Approved Domestic Gas Appliances whic...,Room-sealed (fanned draught) gas water heater,Junkers,WR250-4AM1E31S2705,NOT APPLICABLE,Portugal,"Jebsen & Co., Ltd.",2923 8440,2005-06-06
936,Part 5 – Approved Domestic Gas Appliances whic...,Room-sealed (natural draught) gas water heater,Junkers,J10BF1,NOT APPLICABLE,Portugal,"Jebsen & Co., Ltd.",2923 8440,2005-06-06
937,Part 5 – Approved Domestic Gas Appliances whic...,Room-sealed (natural draught) gas water heater,Junkers,WR225-1AV1B31S2704,NOT APPLICABLE,Portugal,"Jebsen & Co., Ltd.",2923 8440,2005-06-11
938,Part 5 – Approved Domestic Gas Appliances whic...,Room-sealed (natural draught) gas water heater,Junkers,WR250-1AV1B31S2704,NOT APPLICABLE,Portugal,"Jebsen & Co., Ltd.",2923 8440,2005-06-11


# ==========================================

### 2.05 Everyone Do: Pandas Recap and Data Types (15 min)

# Pandas Recap and Data Types

In this activity, we will recap what we have learned about Pandas up to this point.

## Instructions

* Open ‘PandasRecap.ipynb’ under the ‘unsolved’ folder in your Jupyter notebook.

* Go through the cells, and follow the instructions in the comments.

## Hints

* A list of a DataFrame's data types can be checked by accessing its `dtypes` property.

* To change a non-numeric column to a numeric column, use the `df.astype(<datatype>)` method and pass in the desired data type as the parameter.

## References

National Fire Incident Reporting System (NFIRS). Connecticut Fire Department Incidents (2012-2016). [https://data.ct.gov/Public-Safety/Connecticut-Fire-Department-Incidents-2012-2016-/qem9-rt8k](https://data.ct.gov/Public-Safety/Connecticut-Fire-Department-Incidents-2012-2016-/qem9-rt8k), reduced in Pandas.

---

In [49]:
# Dependencies
import pandas as pd
from pathlib import Path

In [50]:
# Create a reference the CSV file desired
csv_path = Path("05-Evr_Pandas_Recap/Solved/Resources/CT_fires_2015.csv")

# Read the CSV into a Pandas DataFrame
fires_df = pd.read_csv(csv_path, low_memory=False)

# Print the first five rows of data to the screen
fires_df.head()

Unnamed: 0,Reporting Year,Fire Department Header Key,Fire Department Name,Incident date,Incident Number,Exposure Number,Fire Department Station,Incident Type Code,Incident Type,Aid Given or Received Code,...,Mized Use Property Codes,Mixed Use Property,Property Use Code,Property Use,Incident Street Address,Incident Apartment Number,Incident City,Incident Zip Code,Census Tract,Location
0,2015,CT06240,WATERBURY FIRE DEPARTMENT,05/11/2015,6499.0,0,7.0,440,"Electrical wiring/equipment problem, other.",N,...,,,962.0,"Residential street, road, or residential drive...",GREENLEAF AVE,,WATERBURY,6705,,GREENLEAF AVE06705
1,2015,CT08180,THE TOWN OF SALEM,12/01/2015,24036.0,0,21.0,111,Building fire. Excludes confined fires (113–118).,3,...,,,,,Route 82,,Salem,6420,,Route 8206420
2,2015,CT08190,STONINGTON FIRE DEPT,09/07/2015,76.0,0,1.0,400,"Hazardous condition (no fire), other.",N,...,,,0.0,"Property use, other.",Bayview AVE,,Stonington Borough,6378,,Bayview AVE06378
3,2015,CT01012,RIVERTON VOLUNTEER FIRE DEPT,01/02/2015,1.0,0,1.0,444,Power line down. Excludes people trapped by do...,3,...,,,,,Wallens Hill1 RD,,Riverton,6065,,Wallens Hill1 RD06065
4,2015,CT01070,GOSHEN VOLUNTEER FIRE COMPANY,11/03/2015,37.0,0,,150,"Outside rubbish fire, other.",N,...,,,936.0,"Vacant lot. Undeveloped land, not paved, may i...",E east street south,,Goshen,6756,,E east street south06756


In [51]:
# Check the names of all the columns and see if there are any rows with missing data
fires_df.count()

Reporting Year                       34125
Fire Department Header Key           34125
Fire Department Name                 34125
Incident date                        34125
Incident Number                      34125
Exposure Number                      34125
Fire Department Station              28748
Incident Type Code                   34125
Incident Type                        34125
Aid Given or Received Code           34125
Aid Given or Received                34125
Number of Alarms                     32780
Alarm Date and Time                  34125
Arrival Date and Time                34125
Incident Controlled Date and Time     8505
Last Unit Cleared Date and Time      34125
Fire Department Shift                22099
Actions Taken 1                      33905
Actions Taken 2                      10028
Actions Taken 3                       3077
Number of Suppression Apparatus      34125
Number of Suppression Personnel      34125
Number of EMS Apparatus              34125
Number of E

In [52]:
# Rename mistyped columns "Aid Given or Received Code " and "Propery Loss"
fires_df = fires_df.rename(columns={"Aid Given or Received Code ": "Aid Given or Received Code", 
                                    "Propery Loss": "Property Loss"})

In [53]:
# Reduce to columns: Reporting Year, Fire Department Name, Incident date, Incident Type,
# Aid Given or Received Code, Aid Given or Received, Number of Alarms, Alarm Date and Time,
# Arrival Date and Time, Last Unit Cleared Date and Time, Actions Taken 1, Actions Taken 2,
# Actions Taken 3, Property Value, Property Loss, Contents Value, Contents Loss,
# Fire Service Deaths, Fire Service Injuries, Other Fire Deaths, Other Fire Injuries,
# Property Use, Incident Street Address, Incident Apartment Number, Incident City, Incident Zip Code

fires_reduced = fires_df[["Reporting Year", "Fire Department Name", "Incident date", 
                          "Incident Type", "Aid Given or Received Code", "Aid Given or Received", 
                          "Number of Alarms", "Alarm Date and Time", "Arrival Date and Time", 
                          "Last Unit Cleared Date and Time", "Actions Taken 1", 
                          "Actions Taken 2", "Actions Taken 3", "Property Value", 
                          "Property Loss", "Contents Value", "Contents Loss", "Fire Service Deaths", 
                          "Fire Service Injuries", "Other Fire Deaths", "Other Fire Injuries", 
                          "Property Use", "Incident Street Address", "Incident Apartment Number", 
                          "Incident City", "Incident Zip Code"]]
fires_reduced.head()

Unnamed: 0,Reporting Year,Fire Department Name,Incident date,Incident Type,Aid Given or Received Code,Aid Given or Received,Number of Alarms,Alarm Date and Time,Arrival Date and Time,Last Unit Cleared Date and Time,...,Contents Loss,Fire Service Deaths,Fire Service Injuries,Other Fire Deaths,Other Fire Injuries,Property Use,Incident Street Address,Incident Apartment Number,Incident City,Incident Zip Code
0,2015,WATERBURY FIRE DEPARTMENT,05/11/2015,"Electrical wiring/equipment problem, other.",N,No aid given or received.,1.0,05/11/2015 03:04:00 PM,05/11/2015 03:12:00 PM,05/11/2015 03:18:00 PM,...,,0.0,0.0,,,"Residential street, road, or residential drive...",GREENLEAF AVE,,WATERBURY,6705
1,2015,THE TOWN OF SALEM,12/01/2015,Building fire. Excludes confined fires (113–118).,3,Mutual aid given to an outside fire service en...,0.0,12/01/2015 09:14:00 AM,12/01/2015 09:19:00 AM,12/01/2015 09:42:00 AM,...,,0.0,0.0,,,,Route 82,,Salem,6420
2,2015,STONINGTON FIRE DEPT,09/07/2015,"Hazardous condition (no fire), other.",N,No aid given or received.,0.0,09/07/2015 05:50:00 PM,09/07/2015 05:54:00 PM,09/07/2015 06:30:00 PM,...,,0.0,0.0,,,"Property use, other.",Bayview AVE,,Stonington Borough,6378
3,2015,RIVERTON VOLUNTEER FIRE DEPT,01/02/2015,Power line down. Excludes people trapped by do...,3,Mutual aid given to an outside fire service en...,0.0,01/02/2015 12:06:00 PM,01/02/2015 12:13:00 PM,01/02/2015 01:58:00 PM,...,,0.0,0.0,,,,Wallens Hill1 RD,,Riverton,6065
4,2015,GOSHEN VOLUNTEER FIRE COMPANY,11/03/2015,"Outside rubbish fire, other.",N,No aid given or received.,,11/03/2015 03:06:00 AM,11/03/2015 03:10:00 AM,11/03/2015 03:45:00 AM,...,0.0,0.0,0.0,,,"Vacant lot. Undeveloped land, not paved, may i...",E east street south,,Goshen,6756


In [54]:
# Fill NAs for columns "Actions Taken 1", "Actions Taken 2", "Actions Taken 3", 
# and "Incident Apartment Number" with ''
# Fill NAs for columns "Other Fire Deaths", "Other Fire Injuries",
# "Property Loss", and "Contents Loss" with 0
fires_reduced = fires_reduced.fillna({"Actions Taken 1": '', 
                                      "Actions Taken 2": '', 
                                      "Actions Taken 3": '',
                                      "Incident Apartment Number": '',
                                      "Other Fire Deaths": 0,
                                      "Other Fire Injuries": 0,
                                      "Property Loss": 0,
                                      "Contents Loss": 0})
fires_reduced.head()

Unnamed: 0,Reporting Year,Fire Department Name,Incident date,Incident Type,Aid Given or Received Code,Aid Given or Received,Number of Alarms,Alarm Date and Time,Arrival Date and Time,Last Unit Cleared Date and Time,...,Contents Loss,Fire Service Deaths,Fire Service Injuries,Other Fire Deaths,Other Fire Injuries,Property Use,Incident Street Address,Incident Apartment Number,Incident City,Incident Zip Code
0,2015,WATERBURY FIRE DEPARTMENT,05/11/2015,"Electrical wiring/equipment problem, other.",N,No aid given or received.,1.0,05/11/2015 03:04:00 PM,05/11/2015 03:12:00 PM,05/11/2015 03:18:00 PM,...,0.0,0.0,0.0,0.0,0.0,"Residential street, road, or residential drive...",GREENLEAF AVE,,WATERBURY,6705
1,2015,THE TOWN OF SALEM,12/01/2015,Building fire. Excludes confined fires (113–118).,3,Mutual aid given to an outside fire service en...,0.0,12/01/2015 09:14:00 AM,12/01/2015 09:19:00 AM,12/01/2015 09:42:00 AM,...,0.0,0.0,0.0,0.0,0.0,,Route 82,,Salem,6420
2,2015,STONINGTON FIRE DEPT,09/07/2015,"Hazardous condition (no fire), other.",N,No aid given or received.,0.0,09/07/2015 05:50:00 PM,09/07/2015 05:54:00 PM,09/07/2015 06:30:00 PM,...,0.0,0.0,0.0,0.0,0.0,"Property use, other.",Bayview AVE,,Stonington Borough,6378
3,2015,RIVERTON VOLUNTEER FIRE DEPT,01/02/2015,Power line down. Excludes people trapped by do...,3,Mutual aid given to an outside fire service en...,0.0,01/02/2015 12:06:00 PM,01/02/2015 12:13:00 PM,01/02/2015 01:58:00 PM,...,0.0,0.0,0.0,0.0,0.0,,Wallens Hill1 RD,,Riverton,6065
4,2015,GOSHEN VOLUNTEER FIRE COMPANY,11/03/2015,"Outside rubbish fire, other.",N,No aid given or received.,,11/03/2015 03:06:00 AM,11/03/2015 03:10:00 AM,11/03/2015 03:45:00 AM,...,0.0,0.0,0.0,0.0,0.0,"Vacant lot. Undeveloped land, not paved, may i...",E east street south,,Goshen,6756


In [55]:
# Remove remaining rows with missing data
fires_cleaned_df = fires_reduced.dropna(how="any")
fires_cleaned_df.count()

Reporting Year                     7145
Fire Department Name               7145
Incident date                      7145
Incident Type                      7145
Aid Given or Received Code         7145
Aid Given or Received              7145
Number of Alarms                   7145
Alarm Date and Time                7145
Arrival Date and Time              7145
Last Unit Cleared Date and Time    7145
Actions Taken 1                    7145
Actions Taken 2                    7145
Actions Taken 3                    7145
Property Value                     7145
Property Loss                      7145
Contents Value                     7145
Contents Loss                      7145
Fire Service Deaths                7145
Fire Service Injuries              7145
Other Fire Deaths                  7145
Other Fire Injuries                7145
Property Use                       7145
Incident Street Address            7145
Incident Apartment Number          7145
Incident City                      7145


In [56]:
# Filter data to incidents that caused Property or Contents Loss
loss_df = fires_cleaned_df.loc[(fires_cleaned_df["Property Loss"] > 0) |
                               (fires_cleaned_df["Contents Loss"] > 0) , :]
loss_df.head()

Unnamed: 0,Reporting Year,Fire Department Name,Incident date,Incident Type,Aid Given or Received Code,Aid Given or Received,Number of Alarms,Alarm Date and Time,Arrival Date and Time,Last Unit Cleared Date and Time,...,Contents Loss,Fire Service Deaths,Fire Service Injuries,Other Fire Deaths,Other Fire Injuries,Property Use,Incident Street Address,Incident Apartment Number,Incident City,Incident Zip Code
24,2015,CITY OF DERBY FIRE DEPT,09/04/2015,"Fire, other.",N,No aid given or received.,1.0,09/04/2015 03:49:00 PM,09/04/2015 03:50:00 PM,09/04/2015 03:59:00 PM,...,0.0,0.0,0.0,0.0,0.0,"Street, other.",Roosevelt DR,,Derby,6418
66,2015,TOWN OF MANCHESTER,02/06/2015,"Fire, other.",N,No aid given or received.,1.0,02/06/2015 03:23:00 AM,02/06/2015 03:29:00 AM,02/06/2015 03:57:00 AM,...,500.0,0.0,0.0,0.0,0.0,Boarding/Rooming house. Includes residential h...,109 Foster ST,,Manchester,6040
69,2015,TOWN OF MANCHESTER,02/02/2015,"Mobile property (vehicle) fire, other.",N,No aid given or received.,1.0,02/02/2015 06:11:00 PM,02/02/2015 06:15:00 PM,02/02/2015 06:25:00 PM,...,0.0,0.0,0.0,0.0,0.0,"Residential street, road, or residential drive...",300 Charter Oak ST,,Manchester,6040
94,2015,NEW LONDON FIRE DEPARTMENT,05/09/2015,"Special outside fire, other.",N,No aid given or received.,0.0,05/09/2015 01:31:00 PM,05/09/2015 01:35:00 PM,05/09/2015 01:40:00 PM,...,1.0,0.0,0.0,0.0,0.0,Street or road in commercial area.,BANK ST,,NEW LONDON,6320
153,2015,BUNGAY FIRE BRIGADE,05/23/2015,Cooking fire involving the contents of a cooki...,N,No aid given or received.,0.0,05/23/2015 07:07:00 AM,05/23/2015 07:26:00 AM,05/23/2015 09:09:00 AM,...,4000.0,0.0,0.0,0.0,0.0,"1- or 2-family dwelling, detached, manufacture...",Jeans CT,,Woodstock,6281


In [57]:
# Count how many incidents occured in each city
city_counts = loss_df["Incident City"].value_counts()
city_counts

Hartford       264
Bridgeport     216
WATERBURY      187
NEW LONDON     120
Manchester      84
              ... 
PLYMOUTH         1
Stamford         1
New Preston      1
DERBY            1
Madison          1
Name: Incident City, Length: 104, dtype: int64

In [58]:
# Convert the city_counts Series into a DataFrame
city_loss_counts_df = pd.DataFrame(city_counts)
city_loss_counts_df.head()

Unnamed: 0,Incident City
Hartford,264
Bridgeport,216
WATERBURY,187
NEW LONDON,120
Manchester,84


In [59]:
# Convert the column name into "Sum of Loss Incidents"
city_loss_counts_df = city_loss_counts_df.rename(
    columns={"Incident City": "Sum of Loss Incidents"})
city_loss_counts_df.head()

Unnamed: 0,Sum of Loss Incidents
Hartford,264
Bridgeport,216
WATERBURY,187
NEW LONDON,120
Manchester,84


In [60]:
# Calculate the number of deaths from fire incidents where loss occurred
deaths = loss_df["Fire Service Deaths"].sum() + loss_df["Other Fire Deaths"].sum()
deaths

12.0

In [61]:
# Want to calculate the fire department response time? There is a problem
# Problem can be seen by examining datatypes within the DataFrame
loss_df.dtypes

Reporting Year                       int64
Fire Department Name                object
Incident date                       object
Incident Type                       object
Aid Given or Received Code          object
Aid Given or Received               object
Number of Alarms                    object
Alarm Date and Time                 object
Arrival Date and Time               object
Last Unit Cleared Date and Time     object
Actions Taken 1                     object
Actions Taken 2                     object
Actions Taken 3                     object
Property Value                     float64
Property Loss                      float64
Contents Value                     float64
Contents Loss                      float64
Fire Service Deaths                float64
Fire Service Injuries              float64
Other Fire Deaths                  float64
Other Fire Injuries                float64
Property Use                        object
Incident Street Address             object
Incident Ap

In [62]:
# Convert relevant date columns to datetime
loss_df = loss_df.astype({"Incident date": "datetime64",
    "Alarm Date and Time": "datetime64",
    "Arrival Date and Time": "datetime64",
    "Last Unit Cleared Date and Time": "datetime64"})

loss_df.dtypes

Reporting Year                              int64
Fire Department Name                       object
Incident date                      datetime64[ns]
Incident Type                              object
Aid Given or Received Code                 object
Aid Given or Received                      object
Number of Alarms                           object
Alarm Date and Time                datetime64[ns]
Arrival Date and Time              datetime64[ns]
Last Unit Cleared Date and Time    datetime64[ns]
Actions Taken 1                            object
Actions Taken 2                            object
Actions Taken 3                            object
Property Value                            float64
Property Loss                             float64
Contents Value                            float64
Contents Loss                             float64
Fire Service Deaths                       float64
Fire Service Injuries                     float64
Other Fire Deaths                         float64


In [63]:
# Now it is possible to find the response time in seconds
# Hint: create a new column for "Response Time (seconds)" and use .dt.total_seconds()
# to calculate the seconds
loss_df["Response Time (seconds)"] = loss_df["Arrival Date and Time"] - loss_df["Alarm Date and Time"]
loss_df["Response Time (seconds)"] = loss_df["Response Time (seconds)"].dt.total_seconds()
loss_df["Response Time (seconds)"] = loss_df["Response Time (seconds)"].astype("int")

In [64]:
# Check data for columns of your choosing
response_times = loss_df[["Fire Department Name", "Incident date", "Incident Type", "Arrival Date and Time",
        "Alarm Date and Time", "Response Time (seconds)"]]
response_times.head()

Unnamed: 0,Fire Department Name,Incident date,Incident Type,Arrival Date and Time,Alarm Date and Time,Response Time (seconds)
24,CITY OF DERBY FIRE DEPT,2015-09-04,"Fire, other.",2015-09-04 15:50:00,2015-09-04 15:49:00,60
66,TOWN OF MANCHESTER,2015-02-06,"Fire, other.",2015-02-06 03:29:00,2015-02-06 03:23:00,360
69,TOWN OF MANCHESTER,2015-02-02,"Mobile property (vehicle) fire, other.",2015-02-02 18:15:00,2015-02-02 18:11:00,240
94,NEW LONDON FIRE DEPARTMENT,2015-05-09,"Special outside fire, other.",2015-05-09 13:35:00,2015-05-09 13:31:00,240
153,BUNGAY FIRE BRIGADE,2015-05-23,Cooking fire involving the contents of a cooki...,2015-05-23 07:26:00,2015-05-23 07:07:00,1140


# ==========================================

# Break (10 mins)

# ==========================================

### 2.06 Instructor Do: Pandas Grouping (10 min)

In [153]:
# Import the Pandas library
import pandas as pd
from pathlib import Path

In [154]:
# Create a reference the CSV file desired
csv_path = Path("06-Ins_GroupBy/Solved/Resources/CT_fires_2015.csv")

# Read the CSV into a Pandas DataFrame
fires_df = pd.read_csv(csv_path, low_memory=False)

# Print the first five rows of data to the screen
fires_df.head()

Unnamed: 0,Reporting Year,Fire Department Header Key,Fire Department Name,Incident date,Incident Number,Exposure Number,Fire Department Station,Incident Type Code,Incident Type,Aid Given or Received Code,...,Mized Use Property Codes,Mixed Use Property,Property Use Code,Property Use,Incident Street Address,Incident Apartment Number,Incident City,Incident Zip Code,Census Tract,Location
0,2015,CT06240,WATERBURY FIRE DEPARTMENT,05/11/2015,6499.0,0,7.0,440,"Electrical wiring/equipment problem, other.",N,...,,,962.0,"Residential street, road, or residential drive...",GREENLEAF AVE,,WATERBURY,6705,,GREENLEAF AVE06705
1,2015,CT08180,THE TOWN OF SALEM,12/01/2015,24036.0,0,21.0,111,Building fire. Excludes confined fires (113–118).,3,...,,,,,Route 82,,Salem,6420,,Route 8206420
2,2015,CT08190,STONINGTON FIRE DEPT,09/07/2015,76.0,0,1.0,400,"Hazardous condition (no fire), other.",N,...,,,0.0,"Property use, other.",Bayview AVE,,Stonington Borough,6378,,Bayview AVE06378
3,2015,CT01012,RIVERTON VOLUNTEER FIRE DEPT,01/02/2015,1.0,0,1.0,444,Power line down. Excludes people trapped by do...,3,...,,,,,Wallens Hill1 RD,,Riverton,6065,,Wallens Hill1 RD06065
4,2015,CT01070,GOSHEN VOLUNTEER FIRE COMPANY,11/03/2015,37.0,0,,150,"Outside rubbish fire, other.",N,...,,,936.0,"Vacant lot. Undeveloped land, not paved, may i...",E east street south,,Goshen,6756,,E east street south06756


In [155]:
# Rename mistyped columns "Propery Loss"
fires_df = fires_df.rename(columns={"Propery Loss": "Property Loss"})

# Reduce to columns: Fire Department Name, Incident date, Incident Type Code, Incident Type,
# Alarm Date and Time, Arrival Date and Time, Last Unit Cleared Date and Time, 
# Property Loss, Contents Loss, Fire Service Deaths, Fire Service Injuries, 
# Other Fire Deaths, Other Fire Injuries, Incident City, Incident Zip Code

fires_reduced = fires_df[["Fire Department Name", "Incident date", "Incident Type Code",
                          "Incident Type", "Alarm Date and Time", "Arrival Date and Time", 
                          "Last Unit Cleared Date and Time", "Property Loss", "Contents Loss", 
                          "Fire Service Deaths", "Fire Service Injuries", "Other Fire Deaths", 
                          "Other Fire Injuries", "Incident City", "Incident Zip Code"]]
fires_reduced.head()

Unnamed: 0,Fire Department Name,Incident date,Incident Type Code,Incident Type,Alarm Date and Time,Arrival Date and Time,Last Unit Cleared Date and Time,Property Loss,Contents Loss,Fire Service Deaths,Fire Service Injuries,Other Fire Deaths,Other Fire Injuries,Incident City,Incident Zip Code
0,WATERBURY FIRE DEPARTMENT,05/11/2015,440,"Electrical wiring/equipment problem, other.",05/11/2015 03:04:00 PM,05/11/2015 03:12:00 PM,05/11/2015 03:18:00 PM,,,0.0,0.0,,,WATERBURY,6705
1,THE TOWN OF SALEM,12/01/2015,111,Building fire. Excludes confined fires (113–118).,12/01/2015 09:14:00 AM,12/01/2015 09:19:00 AM,12/01/2015 09:42:00 AM,,,0.0,0.0,,,Salem,6420
2,STONINGTON FIRE DEPT,09/07/2015,400,"Hazardous condition (no fire), other.",09/07/2015 05:50:00 PM,09/07/2015 05:54:00 PM,09/07/2015 06:30:00 PM,,,0.0,0.0,,,Stonington Borough,6378
3,RIVERTON VOLUNTEER FIRE DEPT,01/02/2015,444,Power line down. Excludes people trapped by do...,01/02/2015 12:06:00 PM,01/02/2015 12:13:00 PM,01/02/2015 01:58:00 PM,,,0.0,0.0,,,Riverton,6065
4,GOSHEN VOLUNTEER FIRE COMPANY,11/03/2015,150,"Outside rubbish fire, other.",11/03/2015 03:06:00 AM,11/03/2015 03:10:00 AM,11/03/2015 03:45:00 AM,0.0,0.0,0.0,0.0,,,Goshen,6756


In [156]:
# The columns we selected with missing data are all numbers, so we can fill NAs with 0
fires_cleaned_df = fires_reduced.fillna(0)
fires_cleaned_df.count()

Fire Department Name               34125
Incident date                      34125
Incident Type Code                 34125
Incident Type                      34125
Alarm Date and Time                34125
Arrival Date and Time              34125
Last Unit Cleared Date and Time    34125
Property Loss                      34125
Contents Loss                      34125
Fire Service Deaths                34125
Fire Service Injuries              34125
Other Fire Deaths                  34125
Other Fire Injuries                34125
Incident City                      34125
Incident Zip Code                  34125
dtype: int64

In [157]:
fires_cleaned_df.head()

Unnamed: 0,Fire Department Name,Incident date,Incident Type Code,Incident Type,Alarm Date and Time,Arrival Date and Time,Last Unit Cleared Date and Time,Property Loss,Contents Loss,Fire Service Deaths,Fire Service Injuries,Other Fire Deaths,Other Fire Injuries,Incident City,Incident Zip Code
0,WATERBURY FIRE DEPARTMENT,05/11/2015,440,"Electrical wiring/equipment problem, other.",05/11/2015 03:04:00 PM,05/11/2015 03:12:00 PM,05/11/2015 03:18:00 PM,0.0,0.0,0.0,0.0,0.0,0.0,WATERBURY,6705
1,THE TOWN OF SALEM,12/01/2015,111,Building fire. Excludes confined fires (113–118).,12/01/2015 09:14:00 AM,12/01/2015 09:19:00 AM,12/01/2015 09:42:00 AM,0.0,0.0,0.0,0.0,0.0,0.0,Salem,6420
2,STONINGTON FIRE DEPT,09/07/2015,400,"Hazardous condition (no fire), other.",09/07/2015 05:50:00 PM,09/07/2015 05:54:00 PM,09/07/2015 06:30:00 PM,0.0,0.0,0.0,0.0,0.0,0.0,Stonington Borough,6378
3,RIVERTON VOLUNTEER FIRE DEPT,01/02/2015,444,Power line down. Excludes people trapped by do...,01/02/2015 12:06:00 PM,01/02/2015 12:13:00 PM,01/02/2015 01:58:00 PM,0.0,0.0,0.0,0.0,0.0,0.0,Riverton,6065
4,GOSHEN VOLUNTEER FIRE COMPANY,11/03/2015,150,"Outside rubbish fire, other.",11/03/2015 03:06:00 AM,11/03/2015 03:10:00 AM,11/03/2015 03:45:00 AM,0.0,0.0,0.0,0.0,0.0,0.0,Goshen,6756


In [158]:
# Convert relevant date columns to datetime
converted_fires_df = fires_cleaned_df.copy()
converted_fires_df = converted_fires_df.astype({"Incident date": "datetime64",
    "Alarm Date and Time": "datetime64",
    "Arrival Date and Time": "datetime64",
    "Last Unit Cleared Date and Time": "datetime64"})

converted_fires_df.head()

Unnamed: 0,Fire Department Name,Incident date,Incident Type Code,Incident Type,Alarm Date and Time,Arrival Date and Time,Last Unit Cleared Date and Time,Property Loss,Contents Loss,Fire Service Deaths,Fire Service Injuries,Other Fire Deaths,Other Fire Injuries,Incident City,Incident Zip Code
0,WATERBURY FIRE DEPARTMENT,2015-05-11,440,"Electrical wiring/equipment problem, other.",2015-05-11 15:04:00,2015-05-11 15:12:00,2015-05-11 15:18:00,0.0,0.0,0.0,0.0,0.0,0.0,WATERBURY,6705
1,THE TOWN OF SALEM,2015-12-01,111,Building fire. Excludes confined fires (113–118).,2015-12-01 09:14:00,2015-12-01 09:19:00,2015-12-01 09:42:00,0.0,0.0,0.0,0.0,0.0,0.0,Salem,6420
2,STONINGTON FIRE DEPT,2015-09-07,400,"Hazardous condition (no fire), other.",2015-09-07 17:50:00,2015-09-07 17:54:00,2015-09-07 18:30:00,0.0,0.0,0.0,0.0,0.0,0.0,Stonington Borough,6378
3,RIVERTON VOLUNTEER FIRE DEPT,2015-01-02,444,Power line down. Excludes people trapped by do...,2015-01-02 12:06:00,2015-01-02 12:13:00,2015-01-02 13:58:00,0.0,0.0,0.0,0.0,0.0,0.0,Riverton,6065
4,GOSHEN VOLUNTEER FIRE COMPANY,2015-11-03,150,"Outside rubbish fire, other.",2015-11-03 03:06:00,2015-11-03 03:10:00,2015-11-03 03:45:00,0.0,0.0,0.0,0.0,0.0,0.0,Goshen,6756


In [159]:
# Create a new column for "Response Time (seconds)" and calculate
converted_fires_df["Response Time (seconds)"] = converted_fires_df["Arrival Date and Time"] - converted_fires_df["Alarm Date and Time"]
converted_fires_df["Response Time (seconds)"] = converted_fires_df["Response Time (seconds)"].dt.total_seconds()
converted_fires_df["Response Time (seconds)"] = converted_fires_df["Response Time (seconds)"].astype("int")

# Create a new column for "Incident Duration (seconds)" and calculate
converted_fires_df["Incident Duration (seconds)"] = converted_fires_df["Last Unit Cleared Date and Time"] - converted_fires_df["Alarm Date and Time"]
converted_fires_df["Incident Duration (seconds)"] = converted_fires_df["Incident Duration (seconds)"].dt.total_seconds()
converted_fires_df["Incident Duration (seconds)"] = converted_fires_df["Incident Duration (seconds)"].astype("int")

converted_fires_df.head(10)

Unnamed: 0,Fire Department Name,Incident date,Incident Type Code,Incident Type,Alarm Date and Time,Arrival Date and Time,Last Unit Cleared Date and Time,Property Loss,Contents Loss,Fire Service Deaths,Fire Service Injuries,Other Fire Deaths,Other Fire Injuries,Incident City,Incident Zip Code,Response Time (seconds),Incident Duration (seconds)
0,WATERBURY FIRE DEPARTMENT,2015-05-11,440,"Electrical wiring/equipment problem, other.",2015-05-11 15:04:00,2015-05-11 15:12:00,2015-05-11 15:18:00,0.0,0.0,0.0,0.0,0.0,0.0,WATERBURY,6705,480,840
1,THE TOWN OF SALEM,2015-12-01,111,Building fire. Excludes confined fires (113–118).,2015-12-01 09:14:00,2015-12-01 09:19:00,2015-12-01 09:42:00,0.0,0.0,0.0,0.0,0.0,0.0,Salem,6420,300,1680
2,STONINGTON FIRE DEPT,2015-09-07,400,"Hazardous condition (no fire), other.",2015-09-07 17:50:00,2015-09-07 17:54:00,2015-09-07 18:30:00,0.0,0.0,0.0,0.0,0.0,0.0,Stonington Borough,6378,240,2400
3,RIVERTON VOLUNTEER FIRE DEPT,2015-01-02,444,Power line down. Excludes people trapped by do...,2015-01-02 12:06:00,2015-01-02 12:13:00,2015-01-02 13:58:00,0.0,0.0,0.0,0.0,0.0,0.0,Riverton,6065,420,6720
4,GOSHEN VOLUNTEER FIRE COMPANY,2015-11-03,150,"Outside rubbish fire, other.",2015-11-03 03:06:00,2015-11-03 03:10:00,2015-11-03 03:45:00,0.0,0.0,0.0,0.0,0.0,0.0,Goshen,6756,240,2340
5,THOMASTON FIRE DEPT,2015-04-04,444,Power line down. Excludes people trapped by do...,2015-04-04 12:37:00,2015-04-04 12:40:00,2015-04-04 14:00:00,0.0,0.0,0.0,0.0,0.0,0.0,Thomaston,6787,180,4980
6,SHARON FIRE DEPARTMENT,2015-05-06,444,Power line down. Excludes people trapped by do...,2015-05-06 16:53:00,2015-05-06 17:00:00,2015-05-06 19:21:00,0.0,0.0,0.0,0.0,0.0,0.0,Sharon,6069,420,8880
7,SHARON FIRE DEPARTMENT,2015-05-07,412,Gas leak (natural gas or LPG). Excludes gas od...,2015-05-07 22:11:00,2015-05-07 22:15:00,2015-05-07 22:46:00,0.0,0.0,0.0,0.0,0.0,0.0,Sharon,6069,240,2100
8,SHARON FIRE DEPARTMENT,2015-09-27,460,"Accident, potential accident, other.",2015-09-27 18:59:00,2015-09-27 19:04:00,2015-09-27 19:27:00,0.0,0.0,0.0,0.0,0.0,0.0,Sharon,6069,300,1680
9,WATERTOWN FIRE DEPT,2015-05-30,445,"Arcing, shorted electrical equipment.",2015-05-30 21:00:00,2015-05-30 21:10:00,2015-05-30 21:31:00,0.0,0.0,0.0,0.0,0.0,0.0,WATERTOWN,6795,600,1860


In [160]:
# Count how many total incidents have occured within each city
city_counts = converted_fires_df["Incident City"].value_counts()
city_counts.head()

Hartford      1382
Bridgeport    1347
WATERBURY     1235
BRISTOL       1009
MERIDEN        928
Name: Incident City, dtype: int64

In [161]:
# Some cities have a lot of incidents
# Let's filter the data to only incidents of loss
loss_df = converted_fires_df.loc[(fires_reduced["Property Loss"] > 0) | 
                    (fires_reduced["Contents Loss"] > 0),:]
loss_df.head()

Unnamed: 0,Fire Department Name,Incident date,Incident Type Code,Incident Type,Alarm Date and Time,Arrival Date and Time,Last Unit Cleared Date and Time,Property Loss,Contents Loss,Fire Service Deaths,Fire Service Injuries,Other Fire Deaths,Other Fire Injuries,Incident City,Incident Zip Code,Response Time (seconds),Incident Duration (seconds)
24,CITY OF DERBY FIRE DEPT,2015-09-04,100,"Fire, other.",2015-09-04 15:49:00,2015-09-04 15:50:00,2015-09-04 15:59:00,5000.0,0.0,0.0,0.0,0.0,0.0,Derby,6418,60,600
61,BRISTOL FIRE DEPARTMENT,2015-08-27,111,Building fire. Excludes confined fires (113–118).,2015-08-27 15:06:00,2015-08-27 15:09:00,2015-08-27 15:49:00,300.0,150.0,0.0,0.0,0.0,0.0,BRISTOL,6010,180,2580
63,HARTFORD FIRE DEPARTMENT,2015-02-11,150,"Outside rubbish fire, other.",2015-02-11 06:51:00,2015-02-11 06:55:00,2015-02-11 07:01:00,3.0,0.0,0.0,0.0,0.0,0.0,Hartford,6106,240,600
66,TOWN OF MANCHESTER,2015-02-06,100,"Fire, other.",2015-02-06 03:23:00,2015-02-06 03:29:00,2015-02-06 03:57:00,1000.0,500.0,0.0,0.0,0.0,0.0,Manchester,6040,360,2040
69,TOWN OF MANCHESTER,2015-02-02,130,"Mobile property (vehicle) fire, other.",2015-02-02 18:11:00,2015-02-02 18:15:00,2015-02-02 18:25:00,1000.0,0.0,0.0,0.0,0.0,0.0,Manchester,6040,240,840


In [162]:
# Count how many loss incidents have occured in each city
city_loss_counts = loss_df["Incident City"].value_counts()
city_loss_counts.head()

Hartford       318
WATERBURY      281
New Britain    235
Bridgeport     216
NEW LONDON     120
Name: Incident City, dtype: int64

In [171]:
# Count how many loss incidents occurred in each city
grouped_city_df = loss_df.groupby(["Incident City"])

print(grouped_city_df)

grouped_city_df.count().head(10)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001D7BA222348>


Unnamed: 0_level_0,Fire Department Name,Incident date,Incident Type Code,Incident Type,Alarm Date and Time,Arrival Date and Time,Last Unit Cleared Date and Time,Property Loss,Contents Loss,Fire Service Deaths,Fire Service Injuries,Other Fire Deaths,Other Fire Injuries,Incident Zip Code,Response Time (seconds),Incident Duration (seconds)
Incident City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
AMSTON,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
ANSONIA,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
AVON,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6
Andover,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
Ansonia,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18
BERLIN,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32
BETHEL,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13
BLOOMFIELD,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8
BRANFORD,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21
BRIDGEPORT,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13


In [175]:
# Get the total property and content losses.
grouped_city_df.sum()[["Property Loss", "Contents Loss"]]
grouped_city_df[["Property Loss", "Contents Loss"]].sum()

Unnamed: 0_level_0,Property Loss,Contents Loss
Incident City,Unnamed: 1_level_1,Unnamed: 2_level_1
AMSTON,65000.0,5000.0
ANSONIA,5000.0,600.0
AVON,14200.0,1250.0
Andover,2500.0,500.0
Ansonia,644100.0,265100.0
...,...,...
Windham,2000.0,0.0
Windsor Locks,152200.0,19000.0
Woodstock,1000.0,4000.0
east berlin,25.0,0.0


In [177]:
# Save loss sums as series
city_property_loss = grouped_city_df["Property Loss"].sum()
city_contents_loss = grouped_city_df["Contents Loss"].sum()
city_contents_loss.head()

Incident City
AMSTON       5000.0
ANSONIA       600.0
AVON         1250.0
Andover       500.0
Ansonia    265100.0
Name: Contents Loss, dtype: float64

In [178]:
# Create a new DataFrame using count and loss amounts
city_summary_df = pd.DataFrame(
    {
        "Number of Loss Incidents": city_loss_counts,
        "Total Property Loss": city_property_loss,
        "Total Contents Loss": city_contents_loss
    })
city_summary_df.head()

Unnamed: 0,Number of Loss Incidents,Total Property Loss,Total Contents Loss
AMSTON,1,65000.0,5000.0
ANSONIA,2,5000.0,600.0
AVON,6,14200.0,1250.0
Andover,3,2500.0,500.0
Ansonia,18,644100.0,265100.0


In [179]:
# It is also possible to group a DataFrame by multiple columns
# This returns an object with multiple indexes, however, which can be harder to deal with
grouped_city_loss_incidents = loss_df.groupby(["Incident City","Incident Type Code"])

grouped_city_loss_incidents.count().head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Fire Department Name,Incident date,Incident Type,Alarm Date and Time,Arrival Date and Time,Last Unit Cleared Date and Time,Property Loss,Contents Loss,Fire Service Deaths,Fire Service Injuries,Other Fire Deaths,Other Fire Injuries,Incident Zip Code,Response Time (seconds),Incident Duration (seconds)
Incident City,Incident Type Code,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
AMSTON,111,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
ANSONIA,111,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
AVON,111,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
AVON,113,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
AVON,114,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
AVON,131,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
Andover,111,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
Andover,113,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
Ansonia,111,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12
Ansonia,113,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2


In [180]:
# Converting a GroupBy object into a DataFrame
total_city_loss_df = pd.DataFrame(
    grouped_city_loss_incidents[["Property Loss", "Contents Loss"]].sum())
total_city_loss_df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Property Loss,Contents Loss
Incident City,Incident Type Code,Unnamed: 2_level_1,Unnamed: 3_level_1
AMSTON,111,65000.0,5000.0
ANSONIA,111,5000.0,600.0
AVON,111,8000.0,1200.0
AVON,113,0.0,50.0
AVON,114,1000.0,0.0
AVON,131,5200.0,0.0
Andover,111,2500.0,200.0
Andover,113,0.0,300.0
Ansonia,111,617100.0,263500.0
Ansonia,113,5000.0,500.0


In [181]:
# GroupBy is also useful for situations where you may want to calculate the average
incident_time_df = converted_fires_df[["Incident City","Response Time (seconds)", "Incident Duration (seconds)"]]
incident_time_df.groupby(["Incident City"]).mean().head(10)

Unnamed: 0_level_0,Response Time (seconds),Incident Duration (seconds)
Incident City,Unnamed: 1_level_1,Unnamed: 2_level_1
AGUADILLA,1140.0,5700.0
AMSTON,456.0,2082.0
ANSONIA,288.0,1464.0
ASHFORD,960.0,3510.0
AVON,473.376623,2587.012987
Amenia,630.0,4290.0
Andover,454.285714,2224.285714
Ansonia,308.971963,2553.084112
Ashaway,1200.0,1740.0
Ashford,573.846154,2926.153846


# ==========================================

### 2.07 Partners Do: Exploring U.S. Census Data with GroupBy (20 min)

# Exploring Census Data

In this activity, you will revisit the U.S. Census data and create DataFrames with calculated totals and averages for each state by year.

## Instructions

1. Read in the census CSV file with Pandas.

2. Create two new DataFrames, one to find totals and another to find averages. DataFrames should include:

    * Totals for population, employed civilians, unemployed civilians, people in the military, and poverty count.

    * Averages for median age, household income, and per capita income.

3. Create new DataFrames once the totals and averages have been grouped by each year and state.

4. Rename any columns to reflect the data calculations.

5. Export the resulting tables to CSVs. We will be using them again in our next class.

## References

[U.S. Census API - ACS 5-Year Estimates 2016-2019](https://www.census.gov/data/developers/data-sets/census-microdata-api.ACS_5-Year_PUMS.html)

---

In [182]:
# Dependencies
import pandas as pd
from pathlib import Path

In [183]:
# Save file path to variable
census_csv = Path("07-Par_Census_GroupBy/Solved/Resources/census_data_2016-2019.csv")

# Read with Pandas
census_df = pd.read_csv(census_csv)
census_df.head()

Unnamed: 0,Year,County,State,Population,Median Age,Household Income,Per Capita Income,Employed Civilians,Unemployed Civilians,People in the Military,Poverty Count
0,2016,Autauga County,Alabama,55049,37.8,53099.0,26168.0,24262.0,1437.0,309.0,6697.0
1,2016,Baldwin County,Alabama,199510,42.3,51365.0,28069.0,87753.0,5887.0,232.0,25551.0
2,2016,Barbour County,Alabama,26614,38.7,33956.0,17249.0,8993.0,1323.0,0.0,6235.0
3,2016,Bibb County,Alabama,22572,40.2,39776.0,18988.0,8354.0,643.0,5.0,3390.0
4,2016,Blount County,Alabama,57704,40.8,46212.0,21033.0,21593.0,1367.0,9.0,9441.0


In [184]:
# Create a DataFrame with columns to total: Year, County, State, Population, 
# Employed Civilians, Unemployed Civilians, People in the Military, Poverty Count
census_totals_df = census_df[["Year", "County", "State", "Population", 
                              "Employed Civilians", "Unemployed Civilians",
                              "People in the Military", "Poverty Count"]]
census_totals_df.head()

Unnamed: 0,Year,County,State,Population,Employed Civilians,Unemployed Civilians,People in the Military,Poverty Count
0,2016,Autauga County,Alabama,55049,24262.0,1437.0,309.0,6697.0
1,2016,Baldwin County,Alabama,199510,87753.0,5887.0,232.0,25551.0
2,2016,Barbour County,Alabama,26614,8993.0,1323.0,0.0,6235.0
3,2016,Bibb County,Alabama,22572,8354.0,643.0,5.0,3390.0
4,2016,Blount County,Alabama,57704,21593.0,1367.0,9.0,9441.0


In [185]:
# Create a DataFrame of the totals for each state by year.
census_total_group = census_totals_df.groupby(["Year", "State"])

state_totals_df = census_total_group.sum()
state_totals_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Population,Employed Civilians,Unemployed Civilians,People in the Military,Poverty Count
Year,State,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016,Alabama,4841164,2042025.0,184479.0,12150.0,868666.0
2016,Alaska,736855,353954.0,30139.0,16382.0,72826.0
2016,Arizona,6728577,2879372.0,249972.0,17373.0,1165636.0
2016,Arkansas,2968472,1266552.0,93190.0,4445.0,542431.0
2016,California,38654206,17577142.0,1683726.0,130452.0,6004257.0
...,...,...,...,...,...,...
2019,Virginia,8454463,4156018.0,200850.0,120385.0,865691.0
2019,Washington,7404107,3594279.0,187330.0,52871.0,785244.0
2019,West Virginia,1817305,740910.0,51910.0,1306.0,310044.0
2019,Wisconsin,5790716,2982359.0,111564.0,3190.0,639160.0


In [86]:
# Rename columns to make them more understandable
state_totals_df = state_totals_df.rename(columns={"Population": "Total Population",
                                               "Employed Civilians": "Total Employed Civilians",
                                               "Unemployed Civilians": "Total Unemployed Civilians",
                                               "People in the Military": "Total People in the Military",
                                               "Poverty Count": "Total Population in Poverty"})
state_totals_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Total Population,Total Employed Civilians,Total Unemployed Civilians,Total People in the Military,Total Population in Poverty
Year,State,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016,Alabama,4841164,2042025.0,184479.0,12150.0,868666.0
2016,Alaska,736855,353954.0,30139.0,16382.0,72826.0
2016,Arizona,6728577,2879372.0,249972.0,17373.0,1165636.0
2016,Arkansas,2968472,1266552.0,93190.0,4445.0,542431.0
2016,California,38654206,17577142.0,1683726.0,130452.0,6004257.0
...,...,...,...,...,...,...
2019,Virginia,8454463,4156018.0,200850.0,120385.0,865691.0
2019,Washington,7404107,3594279.0,187330.0,52871.0,785244.0
2019,West Virginia,1817305,740910.0,51910.0,1306.0,310044.0
2019,Wisconsin,5790716,2982359.0,111564.0,3190.0,639160.0


In [87]:
# Create a DataFrame with columns to average: Year, County, State, Median Age, 
# Household Income, Per Capita Income
census_avg_df = census_df[["Year", "County", "State", "Median Age", 
                           "Household Income", "Per Capita Income"]]
census_avg_df.head()

Unnamed: 0,Year,County,State,Median Age,Household Income,Per Capita Income
0,2016,Autauga County,Alabama,37.8,53099.0,26168.0
1,2016,Baldwin County,Alabama,42.3,51365.0,28069.0
2,2016,Barbour County,Alabama,38.7,33956.0,17249.0
3,2016,Bibb County,Alabama,40.2,39776.0,18988.0
4,2016,Blount County,Alabama,40.8,46212.0,21033.0


In [88]:
# Create a DataFrame of the averages for each state by year.
census_avg_group = census_avg_df.groupby(["Year", "State"])

state_avg_df = census_avg_group.mean()
state_avg_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Median Age,Household Income,Per Capita Income
Year,State,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016,Alabama,40.250746,38834.925373,21232.746269
2016,Alaska,36.624138,64801.655172,31052.103448
2016,Arizona,39.613333,44166.533333,21786.333333
2016,Arkansas,41.140000,37503.720000,20591.666667
2016,California,39.281034,58091.241379,29025.793103
...,...,...,...,...
2019,Virginia,42.136090,60756.736842,31250.180451
2019,Washington,42.146154,59393.461538,30986.256410
2019,West Virginia,44.365455,44892.236364,24691.836364
2019,Wisconsin,43.656944,58305.861111,31034.361111


In [89]:
# Rename columns to make them more coherent.
state_avg_df = state_avg_df.rename(columns={"Median Age": "Average Median Age by County",
                                            "Household Income": "Average Household Income by County",
                                            "Per Capita Income": "Average Per Capita Income by County"})
state_avg_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Average Median Age by County,Average Household Income by County,Average Per Capita Income by County
Year,State,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016,Alabama,40.250746,38834.925373,21232.746269
2016,Alaska,36.624138,64801.655172,31052.103448
2016,Arizona,39.613333,44166.533333,21786.333333
2016,Arkansas,41.140000,37503.720000,20591.666667
2016,California,39.281034,58091.241379,29025.793103
...,...,...,...,...
2019,Virginia,42.136090,60756.736842,31250.180451
2019,Washington,42.146154,59393.461538,30986.256410
2019,West Virginia,44.365455,44892.236364,24691.836364
2019,Wisconsin,43.656944,58305.861111,31034.361111


In [90]:
# Export the DataFrames to CSV
state_totals_df.to_csv("07-Par_Census_GroupBy/Solved/output/state_totals.csv", index=True)
state_avg_df.to_csv("07-Par_Census_GroupBy/Solved/output/state_avg.csv", index=True)

# ==========================================

### 2.08 Instructor Do: Sorting Made Easy (10 min)

# Sorting
Data Source: Vermont Agency of Administration, Department of Taxes. Meals and Rooms Tax Statistics (2020 Multiple Periods Update, Calendar Year). [https://tax.vermont.gov/data-and-statistics/mrt](https://tax.vermont.gov/data-and-statistics/mrt)

In [186]:
# Import Dependencies
import pandas as pd
from pathlib import Path

In [187]:
csv_path = Path("08-Ins_Sorting/Solved/Resources/VT_tax_statistics.csv")
taxes_df = pd.read_csv(csv_path, encoding="UTF-8")
taxes_df.head()

Unnamed: 0,Town,Meals,Meals Count,Rent,Rent Count,Alcohol,Alcohol Count,Past Meals,Past Meals count,Past Rent,Past Rent Count,Past Alcohol,Past Alchohol Count
0,ADDISON,0.0,0,90173.1,12,0.0,0,0.0,0,172233.0,15,0.0,0
1,ALBURGH,0.0,0,0.0,0,0.0,0,1077515.13,12,333596.32,19,0.0,0
2,ARLINGTON,871615.85,10,313732.79,10,0.0,0,1653105.07,10,1013081.44,16,0.0,0
3,BARNARD,0.0,0,0.0,0,0.0,0,0.0,0,7474505.3,14,0.0,0
4,BARRE,14101058.17,46,0.0,0,1420668.11,19,16480343.05,49,0.0,0,2809362.08,21


In [191]:
# Sorting the DataFrame based on "Meals" column
# Will sort from lowest to highest if no other parameter is passed
meals_taxes_df = taxes_df.sort_values("Meals")
meals_taxes_df.head()

Unnamed: 0,Town,Meals,Meals Count,Rent,Rent Count,Alcohol,Alcohol Count,Past Meals,Past Meals count,Past Rent,Past Rent Count,Past Alcohol,Past Alchohol Count
0,ADDISON,0.0,0,90173.1,12,0.0,0,0.0,0,172233.0,15,0.0,0
98,WELLS,0.0,0,0.0,0,0.0,0,0.0,0,145041.0,11,0.0,0
35,FAIRLEE,0.0,0,1833212.02,10,0.0,0,2379763.68,11,4475959.53,12,0.0,0
36,FAYSTON,0.0,0,105586.77,11,0.0,0,0.0,0,211939.3,19,0.0,0
37,FERRISBURGH,0.0,0,0.0,0,0.0,0,7025450.58,11,5829011.7,15,0.0,0


In [195]:
# To sort from highest to lowest, ascending=False must be passed in
meals_taxes_df = taxes_df.sort_values("Meals", ascending=False)
meals_taxes_df.head()

Unnamed: 0,Town,Meals,Meals Count,Rent,Rent Count,Alcohol,Alcohol Count,Past Meals,Past Meals count,Past Rent,Past Rent Count,Past Alcohol,Past Alchohol Count
17,BURLINGTON,74507552.54,219,18230026.8,26,18324508.2,122,127618300.0,236,53634054.09,44,44233463.37,129
81,SOUTH BURLINGTON,64445667.13,111,13750969.61,19,4138460.85,40,89535980.0,117,38211751.51,25,10313786.7,44
77,RUTLAND,38005509.1,98,1508769.29,14,2973734.52,38,41993320.0,98,3822279.43,14,5316214.36,38
32,ESSEX,36429036.93,91,0.0,0,2359611.62,29,42033580.0,104,0.0,0,4129281.23,31
12,BRATTLEBORO,33966669.55,102,4868408.74,26,2840765.1,41,41448620.0,100,9867296.43,27,6096085.57,42


In [196]:
# It is possible to sort based upon multiple columns
meals_and_rent_count_df = taxes_df.sort_values(
    ["Meals Count", "Rent Count"], ascending=False)
meals_and_rent_count_df.head(15)

Unnamed: 0,Town,Meals,Meals Count,Rent,Rent Count,Alcohol,Alcohol Count,Past Meals,Past Meals count,Past Rent,Past Rent Count,Past Alcohol,Past Alchohol Count
17,BURLINGTON,74507552.54,219,18230026.8,26,18324508.2,122,127618300.0,236,53634054.09,44,44233463.37,129
81,SOUTH BURLINGTON,64445667.13,111,13750969.61,19,4138460.85,40,89535980.0,117,38211751.51,25,10313786.7,44
12,BRATTLEBORO,33966669.55,102,4868408.74,26,2840765.1,41,41448620.0,100,9867296.43,27,6096085.57,42
77,RUTLAND,38005509.1,98,1508769.29,14,2973734.52,38,41993320.0,98,3822279.43,14,5316214.36,38
32,ESSEX,36429036.93,91,0.0,0,2359611.62,29,42033580.0,104,0.0,0,4129281.23,31
7,BENNINGTON,26317917.62,81,3296492.96,23,2225916.88,32,32141520.0,94,7243933.44,27,4199857.36,33
87,STOWE,33678629.46,80,40772303.07,96,10993675.86,54,52189090.0,84,67794549.41,156,18101140.22,58
55,MANCHESTER,21537627.26,65,13410916.83,41,4124721.26,40,30845790.0,69,28037091.09,59,7650316.61,40
59,MONTPELIER,15480173.01,61,0.0,0,1893772.3,28,25917480.0,66,3458227.45,17,4959620.16,29
102,WILLISTON,27712613.17,59,0.0,0,2190208.8,20,39769500.0,58,0.0,0,4164070.87,20


In [197]:
# To see the sorting by multiple columns better, we can compare the last 
# DataFrame with a second column sort on "Alcohol Count"
# (Compare the order of the two "54" value Rent Count rows)
meals_and_alcohol_count_df = taxes_df.sort_values(
    ["Meals Count", "Alcohol Count"], ascending=False)
meals_and_alcohol_count_df.head(15)

Unnamed: 0,Town,Meals,Meals Count,Rent,Rent Count,Alcohol,Alcohol Count,Past Meals,Past Meals count,Past Rent,Past Rent Count,Past Alcohol,Past Alchohol Count
17,BURLINGTON,74507552.54,219,18230026.8,26,18324508.2,122,127618300.0,236,53634054.09,44,44233463.37,129
81,SOUTH BURLINGTON,64445667.13,111,13750969.61,19,4138460.85,40,89535980.0,117,38211751.51,25,10313786.7,44
12,BRATTLEBORO,33966669.55,102,4868408.74,26,2840765.1,41,41448620.0,100,9867296.43,27,6096085.57,42
77,RUTLAND,38005509.1,98,1508769.29,14,2973734.52,38,41993320.0,98,3822279.43,14,5316214.36,38
32,ESSEX,36429036.93,91,0.0,0,2359611.62,29,42033580.0,104,0.0,0,4129281.23,31
7,BENNINGTON,26317917.62,81,3296492.96,23,2225916.88,32,32141520.0,94,7243933.44,27,4199857.36,33
87,STOWE,33678629.46,80,40772303.07,96,10993675.86,54,52189090.0,84,67794549.41,156,18101140.22,58
55,MANCHESTER,21537627.26,65,13410916.83,41,4124721.26,40,30845790.0,69,28037091.09,59,7650316.61,40
59,MONTPELIER,15480173.01,61,0.0,0,1893772.3,28,25917480.0,66,3458227.45,17,4959620.16,29
102,WILLISTON,27712613.17,59,0.0,0,2190208.8,20,39769500.0,58,0.0,0,4164070.87,20


In [199]:
# The index can be reset to provide index numbers based on the new rankings.
new_index_df = meals_and_alcohol_count_df.reset_index(drop=True)
new_index_df.head()

Unnamed: 0,Town,Meals,Meals Count,Rent,Rent Count,Alcohol,Alcohol Count,Past Meals,Past Meals count,Past Rent,Past Rent Count,Past Alcohol,Past Alchohol Count
0,BURLINGTON,74507552.54,219,18230026.8,26,18324508.2,122,127618300.0,236,53634054.09,44,44233463.37,129
1,SOUTH BURLINGTON,64445667.13,111,13750969.61,19,4138460.85,40,89535980.0,117,38211751.51,25,10313786.7,44
2,BRATTLEBORO,33966669.55,102,4868408.74,26,2840765.1,41,41448620.0,100,9867296.43,27,6096085.57,42
3,RUTLAND,38005509.1,98,1508769.29,14,2973734.52,38,41993320.0,98,3822279.43,14,5316214.36,38
4,ESSEX,36429036.93,91,0.0,0,2359611.62,29,42033580.0,104,0.0,0,4129281.23,31


# ==========================================

### 2.09 Students Do: Search for the Worst (20 min)

# Search For The Worst

In this activity, you will take a dataset on San Francisco Airport's utility consumption and determine which month in the dataset had the highest consumption for each utility.

## Instructions

* Read in the CSV file provided, and print it to the screen.

* Print out a list of all the values within the "Utility" column.

* Select a value from this list, and create a new DataFrame that only includes that utility. Note that some utilities have more than one option for "Owner", so you should limit this new DataFrame to a single "Owner" such as "Tenant".

* Sort the DataFrame based on the level of consumption, from most to least.

* Reset the index for the DataFrame so that the index is in order.

* Print out the details of the worst month to the screen.

## References

[SFO Airport Monthly Utility Consumption for Natural Gas, Water, and Electricity](https://data.sfgov.org/Energy-and-Environment/SFO-Airport-Monthly-Utility-Consumption-for-Natura/gcjv-3mzf).

---

In [98]:
# Import Dependencies
import pandas as pd
from pathlib import Path

In [99]:
# Create reference to CSV file
csv_path = Path("09-Stu_SearchForTheWorst/Solved/Resources/SFO_Airport_Utility_Consumption.csv")

# Import the CSV into a pandas DataFrame
consumption_df = pd.read_csv(csv_path)
consumption_df

Unnamed: 0,Year,Month Number,Month,Utility,Owner,Units,Usage
0,2013,1,Jan,Passengers,Campus,PAX,3.209356e+06
1,2013,1,Jan,Gas,Commission,Therms,3.632050e+05
2,2013,1,Jan,Gas,Tenant,Therms,4.939300e+04
3,2013,1,Jan,Electricity,Commission,kWh,1.290435e+07
4,2013,1,Jan,Electricity,Tenant,kWh,1.400216e+07
...,...,...,...,...,...,...,...
555,2019,8,Aug,Gas,Tenant,Therms,6.160002e+03
556,2019,8,Aug,Electricity,Commission,kWh,1.563947e+07
557,2019,8,Aug,Electricity,Tenant,kWh,1.251209e+07
558,2019,8,Aug,Water,Commission,Million Gallons,2.767076e+01


In [100]:
# Collect a list of all the unique values in "Utility"
consumption_df["Utility"].unique()

array(['Passengers', 'Gas', 'Electricity', 'Water'], dtype=object)

In [101]:
# Looking only at Electricity Consumption with "Tenant" owner
electricity_df = consumption_df.loc[(consumption_df["Utility"] == "Electricity") &
                                    (consumption_df["Owner"] == "Tenant"), :]
electricity_df.head()

Unnamed: 0,Year,Month Number,Month,Utility,Owner,Units,Usage
4,2013,1,Jan,Electricity,Tenant,kWh,14002156.0
11,2013,2,Feb,Electricity,Tenant,kWh,12631776.0
18,2013,3,Mar,Electricity,Tenant,kWh,13894596.0
25,2013,4,Apr,Electricity,Tenant,kWh,13548377.0
32,2013,5,May,Electricity,Tenant,kWh,13646478.0


In [102]:
# Sort the DataFrame by the values in the "Usage" column to find the worst month
electricity_df = electricity_df.sort_values(by="Usage", ascending=False)

# Reset the index so that the index is now based on the sorting locations
electricity_df = electricity_df.reset_index(drop=True)

electricity_df.head()

Unnamed: 0,Year,Month Number,Month,Utility,Owner,Units,Usage
0,2015,8,Aug,Electricity,Tenant,kWh,14248996.0
1,2013,7,Jul,Electricity,Tenant,kWh,14213208.26
2,2013,1,Jan,Electricity,Tenant,kWh,14002156.0
3,2014,8,Aug,Electricity,Tenant,kWh,13953716.0
4,2013,10,Oct,Electricity,Tenant,kWh,13933761.0


In [103]:
# Save all of the information collected on the worst month
worst_month = electricity_df.iloc[0, :]
worst_month

Year                   2015
Month Number              8
Month                   Aug
Utility         Electricity
Owner                Tenant
Units                   kWh
Usage            14248996.0
Name: 0, dtype: object

# ==========================================

### Rating Class Objectives

* rate your understanding using 1-5 method in each objective

In [None]:
title = "04.2-Data-Analysis-Exploring Pandas"
objectives = [
    "Navigate through DataFrames using Loc and Iloc",
    "Filter and slice Pandas DataFrames",
    "Create and access Pandas GroupBy objects",
    "Understand how to sort DataFrames",
]
rating = []
total = 0
for i in range(len(objectives)):
    rate = input(objectives[i]+"? ")
    total += int(rate)
    rating.append(objectives[i] + ". (" + rate + "/5)")
print("="*96)
print(f"Self Evaluation for: {title}")
print("-"*24)
for i in rating:
    print(i)
print("-"*64)
print("Average: " + str(total/len(objectives)))