# Comparing rainfall across cities in the continental U.S.A.
##### Author: John Mays

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
# the code from this notebook has been moved to this .py file:
from city_comparison_helpers import *

The question here is, what cities should I consider?  The more cities I consider, the more data I'd have to download.  If I downloaded the **GSOY** CSV for every one of the 83,000 or so weather stations that are listed, I'd have to download ~16GB of CSVs (a rough and likely low estimate).  If I wanted to bring in some big data technologies, I could do it, but I don't. So, should I...
- get a list of cities commonly considered rainy and find one weather station per city?
- get one weather station for every "big" city in the U.S.
- get all of the weather stations in a few states?

Although it's not perfect, I will likely go with the first option

## List of U.S. cities and towns commonly considered rainy:
- Seattle, WA
- Miami, FL
- New Orleans, LA
- Birmingham, AL
- Tampa, FL
- NY, NY
- White Plains, NY
- Brookhaven, NY
- Syracuse, NY
- Buffalo, NY
- Erie, PA
- San Francisco, CA
- Rochester, NY
- Cleveland, OH
- Akron, OH
- Pittsburgh, OR
- Portland, OR
- Salem, OR
- Houston, TX
- Ft. Lauderdale, FL
- Tallahasse, FL
- Aberdeen, WA
- Detroit, MI
- Forks, WA

**ANYWHERE BUT AKRON, OH!!:**

To be honest, I am mostly only interested in living North of the Mason-Dixon line, and also not particularly interested in living in the "middle of nowhere" so I'll take cities I am not interested in off of the list.

I'll also add a few cities I'm just curious about to come up with this list:

In [2]:
cities_stations = {
    "Seattle": "USW00094290",
    "Miami": "USW00012839", # USW00092811
    "New York City": "USW00094728",
    "White Plains": "USW00094745", # techinically Westchester County
    "Brookhaven": "USW00054790",
    "Syracuse": "USW00014771",
    "Buffalo": "USW00014733",
    "Erie": "USW00014860",
    "Pittsburgh": "USW00014762",
    "San Francisco": "USW00023272",
    "West Lafayette": "USC00129430",
    # "Rochester": "USW0001476",
    "Cleveland": "USW00014820",
    "Portland": "USC00356750",
    "Detroit": "CA006139520", # USW00014822, # USC00202016
    "Forks": "USC00452914",
    "Boston": "USW00014739",
    "Madison": "USC00470273",
    "Woods Hole": "US1MABA0003", # US1MABA0013 technically Falmouth, MA
    "Monterey": "USW00023259",
    "Richland": "USW00024163" # USC00457015
}

In [3]:
partial_schema = pd.DataFrame(
    {
        "Column" : [
            "DATE",
            "PRCP",
            "SNOW",
            "DYFG",
            "DYHF",
            "DYTS",
            "EMXP",
            "DP01",
            "DP10",
            "DP1X",
            "DSNW"
        ],
        "Description" : [
            "year AD",
            "total annual precipitation in inches",
            "total annual snowfall in inches",
            "number of days with fog",
            "number of days with 'heavy' fog",
            "number of days with thunderstorms",
            "highest daily total of precipitation in inches",
            "number of days with over 0.01 inches of unspecified \
            precipitation, probably rain w/o snow",
            "number of days with over 0.10 inches of unspecified \
            precipitation, probably rain w/o snow",
            "number of days with over 1.00 inches of unspecified \
            precipitation, probably rain w/o snow",
            "number of days with over 1.00 inches of snowfall"
        ]
    }
)

In [10]:
columns_selection = list(partial_schema["Column"])

In [4]:
cities = total_concat(cities_stations, columns_selection, rng=(2012,2022))

# Looking at statistics:
## First, mean and median over an 11 year time period (2012-2022)

In [5]:
grouped_median = cities.groupby('CITY').median()
grouped_mean = cities.groupby('CITY').mean()

### Highest annual precipitation (mm) (median)

In [16]:
grouped_median["PRCP"].sort_values(ascending=False)

CITY
Forks             3090.30
Miami             1707.00
New York City     1175.80
Woods Hole        1171.20
Portland          1115.40
Cleveland         1085.80
Brookhaven        1080.00
Erie              1074.20
Madison           1072.40
Pittsburgh        1066.10
Buffalo           1058.90
Syracuse          1032.90
Boston            1026.90
Seattle           1017.30
Detroit           1016.00
White Plains       998.25
West Lafayette     972.20
San Francisco      577.10
Monterey           389.30
Richland           168.30
Name: PRCP, dtype: float64

### Highest number of days with any rainfall (mean)

In [17]:
grouped_median["DP01"].sort_values(ascending=False)

CITY
Forks             231.0
Syracuse          172.0
Erie              168.0
Buffalo           165.0
Seattle           161.0
Cleveland         159.0
Pittsburgh        156.0
Portland          156.0
Miami             148.0
Woods Hole        143.0
Detroit           138.0
Boston            127.0
New York City     127.0
Brookhaven        125.0
White Plains      125.0
Madison           118.0
West Lafayette    118.0
Richland           71.0
San Francisco      67.0
Monterey           59.0
Name: DP01, dtype: float64

### Highest annual snow (mm) (median)

In [12]:
grouped_median["SNOW"].sort_values(ascending=False)

CITY
Syracuse          2736.0
Erie              2225.0
Buffalo           2091.0
Boston            1275.0
Cleveland         1177.0
Madison            989.5
Detroit            970.0
New York City      754.0
West Lafayette     489.0
Woods Hole         443.0
Richland           117.0
Portland           104.0
Seattle             51.0
Miami                0.0
Monterey             0.0
Brookhaven           NaN
Forks                NaN
Pittsburgh           NaN
San Francisco        NaN
White Plains         NaN
Name: SNOW, dtype: float64

### Highest number of days with fog (mean)

In [13]:
grouped_mean["DYFG"].sort_values(ascending=False)

CITY
Monterey          175.181818
Syracuse          166.090909
Buffalo           146.090909
Erie              142.363636
Cleveland         136.545455
New York City     125.363636
Boston            117.727273
Miami              92.545455
Richland           89.454545
White Plains       79.818182
Pittsburgh         62.363636
Seattle            56.090909
Brookhaven         44.800000
Forks              18.181818
Madison             4.000000
West Lafayette      2.800000
Detroit                  NaN
Portland                 NaN
San Francisco            NaN
Woods Hole               NaN
Name: DYFG, dtype: float64

# Extra Methods:

In [7]:
# check_years(cities_stations, rng=(2011, 2021))

# Issues
- **Weather station may not be representative of entire city:** Especially with measurements like days with fog, which is a spotty phenomenon, you could be living somewhere around a city, but just not close enough to the weather station to see what they report.  For example, in Cleveland, if you live a few miles from Lake Erie, you will be much less likely to see fog than the weather station close to Lake Erie.  
- Also, in an ideal world, I would attempt to take measurements from several weather stations and aggregate them somehow to be more representative of a general area.  Although, since there are quite a few small towns with relatively sparse data available, I'm not so sure that would be smart.
- **Sample size:** I am only sampling 2011-2021 here, 11 years.  I wish I could sample back into the 70's or so, but there are too many fragmented datasets to pick a larger range, so 11 years of data will have to do.  If I am curious about comparing a few cities furthere, I may use Global Summary of the Month (**GSOM**) data to look further.
- **You may be tempted to say a city on here is "high" or "low" compared to others,** however, this dataset was hand-selected and is skewed towards notoriously rainy places, so it wouldn't really be honest to say "Lafayette gets low rain" because it is only being compared to the upper echelon of rainy places in the USA.