<a href="https://colab.research.google.com/github/jhuang2003/Seattle-Weather/blob/main/Seattle_NYC_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

**Problem**

---
Dr. Egan's family thinks that it rains too much in Seattle, and refuses to visit him. Thus, we want to use data to determine whether it rains more in Seattle, WA than in New York City, NY.

**Source**


---

The original datasets was gathered from [National Centers for Environmental Information](https://www.ncei.noaa.gov/cdo-web/search?datasetid=GHCND) online search tool. It recorded daily precipitation levels in Seattle and New York from January 1, 2020 to January 1, 2024. The two datasets were then cleaned and merged into one in [this Colab notebook](https://raw.githubusercontent.com/jhuang2003/Seattle-Weather/main/seattle_weather_data_processing.ipynb).

## Import libraries

In [86]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import altair as alt
import seaborn as sns
sns.set_theme(style='whitegrid')
import missingno as msno

## Load clean data

In [87]:
df = pd.read_csv('https://raw.githubusercontent.com/jhuang2003/Seattle-Weather/main/clean_seattle_nyc_weather.csv')


##### $\rightarrow$ Review the
contents of the data set.

---
From this we can see that the first half of the dataset recorded the daily precipitation levels in New York City (NYC) and the second half is Seattle (SEA)


In [88]:
df.shape

(2922, 3)

In [89]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2922 entries, 0 to 2921
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           2922 non-null   object 
 1   city           2922 non-null   object 
 2   precipitation  2922 non-null   float64
dtypes: float64(1), object(2)
memory usage: 68.6+ KB


In [90]:
df.head()

Unnamed: 0,date,city,precipitation
0,2020-01-01,NYC,0.013333
1,2020-01-02,NYC,0.0
2,2020-01-03,NYC,0.134444
3,2020-01-04,NYC,0.2
4,2020-01-05,NYC,0.04


In [91]:
df.tail()

Unnamed: 0,date,city,precipitation
2917,2023-12-27,SEA,0.063333
2918,2023-12-28,SEA,0.24
2919,2023-12-29,SEA,0.055
2920,2023-12-30,SEA,0.0425
2921,2023-12-31,SEA,0.05


## State your questions

The overall problem is to compare how much it rains in Seattle and New York City. To answer this general problem, you will need to ask specific questions about the data.


##### $\rightarrow$ List your questions about the data that will help you solve the problem.

---



1.   On average does Seattle or New York City rain more
2.   In what months does Seattle rain more, and vice versa.



## Analysis

Perform analyses necessary to answer the questions. You will likely start by trying many things, some of which are useful and some of which are not. Don't be afraid to try different analyses at first. You will edit your notebook to a clean version that retains only the essential components at the end of the project.

In [92]:
alt.Chart(df).mark_line().encode(
    x='date:T',
    y='precipitation:Q',
    color='city:N'
).properties(
    title='Daily Precipitation Levels'
)

In [93]:
#Calculating average precipitation levels by month
df['date'] = pd.to_datetime(df['date']) #Convert to datetime
df['month'] = df['date'].dt.month_name() #Extracting month
avg_precipitation = df.groupby(['month', 'city'])['precipitation'].mean().reset_index()
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']


In [95]:
chart1 = alt.Chart(avg_precipitation).mark_line().encode(
    x=alt.X('month:N', sort=month_order),
    y=alt.Y('precipitation:Q'),
    color='city:N'
).properties(
    title='Average Monthly Precipitation Levels'
)
chart1

In [97]:
df['year'] = df['date'].dt.year
avg_yearly_precipitation = df.groupby(['year', 'city'])['precipitation'].mean().reset_index()

bars = alt.Chart(avg_yearly_precipitation).mark_bar().encode(
    x='city:N',
    y=alt.Y('precipitation:Q'),
    color='city:N'
)
text = bars.mark_text(
    align='center',
    baseline='bottom',
    dy=-10,
    angle=315,
    color='black'
).encode(
    text=alt.Text('precipitation:Q', format='.2f')
)
chart2 = (bars + text).facet(
    column='year:N',
    title = None
).properties(
    title='Yearly Average Precipitation Levels by City'
)

chart2

The graph resulting from just plotting the data as is appeared to clustered and busy. Therefore, following graphs took the monthly and yearly averages and plotted them by city.

### Results for communication assignment

This file should clearly produce the graphs, tables, models, etc that appear in the communication assignment.

In [98]:
chart1

In [99]:
chart2

## Conclusion

Provide a brief description of your conclusions.

After a thorough analysis of the data we can see that the annual average rainfall In New York City is actually exactly the same or higher than Seattle.  However, when taking a look at the monthly averages we see that from October to March the average rainfall level in Seattle is much higher than New York City. Therefore, if Dr. Egan’s parents were to visit and want to avoid Seattle’s rain and escape New York City’s rain, they should come between March and October.