# Comparing rainfall across cities in the continental U.S.A.
##### Author: John Mays
## Setup:

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
# the code from this notebook has been moved to this .py file:
import city_comparison_helpers as cch

The question here is, what cities should I consider?  The more cities I consider, the more data I'd have to download.  If I downloaded the **GSOY** CSV for every one of the 83,000 or so weather stations that are listed, I'd have to download ~16GB of CSVs (a rough and likely low estimate).  If I wanted to bring in some big data technologies, I could do it, but I don't. So, should I...
- get a list of cities commonly considered rainy and find one weather station per city?
- get one weather station for every "big" city in the U.S.
- get all of the weather stations in a few states?

Although it's not perfect, I will likely go with the first option

##### List of U.S. cities and towns commonly considered rainy:
- Seattle, WA
- Miami, FL
- New Orleans, LA
- Birmingham, AL
- Tampa, FL
- NY, NY
- White Plains, NY
- Brookhaven, NY
- Syracuse, NY
- Buffalo, NY
- Erie, PA
- San Francisco, CA
- Rochester, NY
- Cleveland, OH
- Akron, OH
- Pittsburgh, OR
- Portland, OR
- Salem, OR
- Houston, TX
- Ft. Lauderdale, FL
- Tallahasse, FL
- Aberdeen, WA
- Detroit, MI
- Forks, WA

**ANYWHERE BUT AKRON, OH!!:**

To be honest, I am mostly only interested in living North of the Mason-Dixon line, and also not particularly interested in living in the "middle of nowhere" so I'll take cities I am not interested in off of the list.

I'll also add a few cities I'm just curious about to come up with this list:

In [2]:
# range of years I'm considering:
rng = (2011, 2021) # (inclusive)

In [3]:
cities_stations = {
    "Seattle": "USW00094290",
    "Miami": "USW00012839", # USW00092811
    "New York City": "USW00094728",
    "White Plains": "USW00094745", # techinically Westchester County, NY
    "Brookhaven": "USW00054790",
    "Syracuse": "USW00014771",
    "Buffalo": "USW00014733",
    "Erie": "USW00014860",
    "Pittsburgh": "USW00014762",
    "San Francisco": "USW00023272",
    "West Lafayette": "USC00129430",
    "Cleveland": "USW00014820",
    "Portland": "USC00356750",
    "Detroit": "CA006139520", # USW00014822, # USC00202016
    "Forks": "USC00452914",
    "Boston": "USW00014739",
    "Madison": "USC00470273",
    "Woods Hole": "US1MABA0003", # US1MABA0013 | technically Falmouth, MA
    "Monterey": "USW00023259",
    "Richland": "USW00024163", # USC00457015
    "Cincinnati": "USW00093812",
    "Rochester": "USC00309049", # USW0001476
    "Astoria": "USW00094224", # (Oregon)
    "Annette": "USW00025325", # technically Ketchikan, AK
    "Columbus": "USW00014821",
    "New Orleans": "USW00012916",
    "Houston": "USW00012960",
    "Port Arthur": "USW00012917", # (Texas)
    "Lake Jackson": "USW00012976",
    "Sequim": "USC00457544", # (Washington)
    "Spokane": "USW00024157",
    "Quillayute": "USW00094240",
    "Aberdeen": "USC00450008",
    "Hoquiam": "USW00094225",
    "Conneaut": "USW00004857", # (Ohio)
    "Driggs": "USC00480140", # USC00102676 | technically Alta, Wyoming
    "Norton": "USC00449215", # technically Wise, Virginia
    # NO COMPLETE DATA IN VICINITY:
    # "Seaside": "USC00357641", # (Oregon) 
    "Dunkirk": "USW00014747", # US1NYCQ0009 | (New York)
    "West Palm": "USW00012844", # (Florida)
    "Presque Isle": "USW00014607", # technically Caribou, Maine
    "Brassau Dam": "USC00170814" # (Maine)
}

In [4]:
cch.check_years(cities_stations, rng=rng)

In [5]:
partial_schema = pd.DataFrame(
    {
        "Column" : [
            "DATE",
            "PRCP",
            "SNOW",
            "DYFG",
            "DYHF",
            "DYTS",
            "EMXP",
            "DP01",
            "DP10",
            "DP1X",
            "DSNW"
        ],
        "Description" : [
            "year AD",
            "total annual precipitation in inches",
            "total annual snowfall in inches",
            "number of days with fog",
            "number of days with 'heavy' fog",
            "number of days with thunderstorms",
            "highest daily total of precipitation in inches",
            "number of days with over 0.01 inches of unspecified \
            precipitation, probably rain w/o snow",
            "number of days with over 0.10 inches of unspecified \
            precipitation, probably rain w/o snow",
            "number of days with over 1.00 inches of unspecified \
            precipitation, probably rain w/o snow",
            "number of days with over 1.00 inches of snowfall"
        ]
    }
)

In [6]:
columns_selection = list(partial_schema["Column"])

In [7]:
cities = cch.total_concat(cities_stations, columns_selection, rng=rng)

## Looking at statistics:
### First, mean and median over an 11 year time period (2012-2022)

In [8]:
grouped_median = cities.groupby('CITY').median()
grouped_mean = cities.groupby('CITY').mean()

### Looking at some ranked lists and numbers:

#### Two ways of measuring precipitation
1. Annual rainfall sum
2. number of days with (any amount of) rain per year

In [9]:
cch.series_comparison(grouped_median["PRCP"].sort_values(ascending=False),
                  grouped_median["DP01"].sort_values(ascending=False),
                  titles = ["Mean rainfal per year (mm)", "Number of days with any rainfall"])

Mean rainfal per year (mm) | Number of days with any rainfall
CITY                       | CITY
Annette           3826.20  | Annette           237.0
Forks             3090.30  | Forks             231.0
Quillayute        2727.50  | Quillayute        200.0
Aberdeen          2310.50  | Astoria           199.0
Astoria           1851.90  | Aberdeen          195.0
Hoquiam           1829.80  | Hoquiam           185.0
New Orleans       1734.50  | Rochester         179.5
Miami             1676.00  | Syracuse          177.0
Port Arthur       1669.20  | Seattle           176.0
West Palm         1502.70  | Erie              175.0
Norton            1405.80  | Buffalo           167.0
Houston           1293.00  | Cleveland         163.0
Woods Hole        1199.90  | Presque Isle      163.0
New York City     1177.30  | Brassau Dam       162.0
Lake Jackson      1172.20  | Norton            162.0
Cincinnati        1166.30  | Portland          157.0
Pittsburgh        1122.70  | Pittsburgh        157.0
Col

#### Highest annual snow (mm) (median)

In [10]:
grouped_median["SNOW"].sort_values(ascending=False)

CITY
Presque Isle      3161.0
Driggs            3145.0
Syracuse          2840.0
Brassau Dam       2594.0
Erie              2303.0
Rochester         2266.0
Buffalo           2088.0
Boston            1275.0
Cleveland         1177.0
Spokane            989.0
Detroit            980.0
Madison            956.0
Norton             886.0
New York City      880.0
Columbus           643.0
West Lafayette     489.0
Woods Hole         429.5
Portland           136.0
Richland           117.0
Seattle             64.0
Lake Jackson         0.0
Houston              0.0
Monterey             0.0
Aberdeen             NaN
Annette              NaN
Astoria              NaN
Brookhaven           NaN
Cincinnati           NaN
Conneaut             NaN
Dunkirk              NaN
Forks                NaN
Hoquiam              NaN
Miami                NaN
New Orleans          NaN
Pittsburgh           NaN
Port Arthur          NaN
Quillayute           NaN
San Francisco        NaN
Sequim               NaN
West Palm           

#### Fog vs. Rain (median)

In [11]:
cch.series_comparison(grouped_median["DYFG"].sort_values(ascending=False),
                  grouped_median["DP01"].sort_values(ascending=False),
                  titles = ["Median number of foggy days", "Median number of rainy days"])

Median number of foggy days | Median number of rainy days
CITY                        | CITY
Hoquiam           286.0     | Annette           237.0
Quillayute        270.0     | Forks             231.0
Cincinnati        245.0     | Quillayute        200.0
Astoria           240.0     | Astoria           199.0
Lake Jackson      232.0     | Aberdeen          195.0
Annette           222.0     | Hoquiam           185.0
Port Arthur       213.0     | Rochester         179.5
Monterey          180.0     | Syracuse          177.0
Columbus          176.0     | Seattle           176.0
Presque Isle      171.0     | Erie              175.0
Syracuse          168.0     | Buffalo           167.0
New Orleans       161.0     | Cleveland         163.0
Houston           157.0     | Presque Isle      163.0
Cleveland         156.0     | Brassau Dam       162.0
Buffalo           156.0     | Norton            162.0
Erie              146.0     | Portland          157.0
New York City     142.0     | Pittsburgh   

## Extra Methods:

In [12]:
# cch.check_years(cities_stations, rng=(2011, 2021))

# Issues
- **Weather station may not be representative of entire city:** Especially with measurements like days with fog, which is a spotty phenomenon, you could be living somewhere around a city, but just not close enough to the weather station to see what they report.  For example, in Cleveland, if you live a few miles from Lake Erie, you will be much less likely to see fog than the weather station close to Lake Erie.  
- Also, in an ideal world, I would attempt to take measurements from several weather stations and aggregate them somehow to be more representative of a general area.  Although, since there are quite a few small towns with relatively sparse data available, I'm not so sure that would be smart.
- **Sample size:** I am only sampling 2011-2021 here, 11 years.  I wish I could sample back into the 70's or so, but there are too many fragmented datasets to pick a larger range, so 11 years of data will have to do.  If I am curious about comparing a few cities furthere, I may use Global Summary of the Month (**GSOM**) data to look further.
- **You may be tempted to say a city on here is "high" or "low" compared to others,** however, this dataset was hand-selected and is skewed towards notoriously rainy places, so it wouldn't really be honest to say "Lafayette gets low rain" because it is only being compared to the upper echelon of rainy places in the USA.