<a href="https://colab.research.google.com/github/megan-owen/MAT328-Techniques_in_Data_Science/blob/main/Lab%201%20-%20MAT%20128%20review.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 1: MAT 128 Review

This lab reviews the most important data manipulation, filtering, and plotting techniques from MAT 128:

* data frames 
* loading CSV files, including data columns
* missing data
* mean, median, and standard deviation of a column
* line plots
* accessing date information
* histogram
* creating new columns from existing ones
* bar charts
* filtering

We explore the New York City weather data from 2014 and 2015.  This data was curated by [FiveThirtyEight.com](fivethirtyeight.com) and used in their article [What 12 Months of Record-Setting Temperatures Looks Like Across the U.S.](https://fivethirtyeight.com/features/what-12-months-of-record-setting-temperatures-looks-like-across-the-u-s/).

### Section 1.1: Loading the data

First import the necessary *libraries* (also called *modules* or *packages*):
* [pandas](https://pandas.pydata.org/): data analysis tools
* [matplotlib](https://matplotlib.org/): plotting
* [seaborn](https://seaborn.pydata.org/):  higher-level data visualization built on the MatPlotLib library


In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
# The following line is needed for some versions of Python and Jupyter Notebooks to display the plots in the notebook.
%matplotlib inline

We will now read the data file, which is in *Comma-Separated Values (CSV)* format, into a *DataFrame*. 

Recall that a CSV file is a text file that stores the rows of a table (ex. an Excel table) by separating data values in different columns with commas. 

A *DataFrame* is how the pandas library (and hence Python) stores a table of data in the computer.

We can load the CSV file from a URL into the DataFrame `weather`:

In [None]:
weather = pd.read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/us-weather-history/KNYC.csv")
weather

Or if you have the file in the same directly as this notebook (hard to do on Colab), you can load the CSV file into the DataFrame `weather` as we did in MAT 128:

In [None]:
weather = pd.read_csv("KNYC.csv")
weather

Look at the DataFrame to answer the following questions:
* When does the data begin?  
* When does it end?   
* What kinds of information do we have about the weather?

In the data column names, "actual" refers to the temperature or precipitation (in inches) value recorded on that date, "average" is the historical average value for that day, and "record" is the highest or lowest recorded value for that day.

To only display the first 5 rows of the DataFrame:

In [None]:
weather.head()

### Section 1.2 Dates
While we can see that the `date` column is a date, it is not explicitly stored in the DataFrame as a data, and thus we can't use commands for dates on it.

The following code shows the *data type* of each column, which is how the data is stored in Python.

In [None]:
weather.dtypes

The following code tells Python to encode the `date` column as a DateTime object, and then displays the updated DataFrame.

In [None]:
weather["date"] = pd.to_datetime(weather["date"])
weather.head()

Do you notice any difference in the `date` column compared to the original DataFrame?

Display the data types of each column again.

How have the data types changed?

### Section 1.3 Missing data and column statistics

It looks like we have a year's worth of weather data from July 1, 2014 to June 30, 2015.  See if there is any missing data.

In [None]:
weather.describe()

The `count` row tells us how many values are in each column.  Since the count is 365 for each column, and this is the number of rows in our DataFrame, then we know there is no missing data.

The `describe()` function also displays the 5-number summary (minimum data value, 25 percentile, 50 percentile or *median*, 75 percentile, and maximum data value) for each numerical column, as well as the mean and standard deviation.

What is the standard deviation of the average minimum temperature?

What is the median actual precipitation?

What is the 75th percentile of record maximum temperature?

### Section 1.4 Line Plots

Recall that a line plot shows how some data value changes over time.  The following code creates a line plot of the daily mean temperature.

In [None]:
weather.plot(x = "date", y= "actual_mean_temp")

What trends do you notice in the above plot?  Does this make sense?

Remember, our plots should always have a title and axis labels.  

In [None]:
weather.plot(x = "date", y= "actual_mean_temp",legend = False)
plt.title("Daily mean temperature")
plt.xlabel("Date")
plt.ylabel('Mean temperature (F)')

Make a line plot of the daily precipitation, including axes labels and a title.

What trends do you notice in your plot?  How does this compare with the plot of the mean daily temperature? 

Do you think a line plot is the best way to understand the daily precipiation?

### Section 1.5 Histograms

Let's look at another way to visualize the daily precipitation data: plotting the distribution of the daily precipitation.  Recall that since the precipitation data is quantitative, we do this with a histogram.

In [None]:
weather["actual_precipitation"].hist()

Sometimes it is helpful to have a finer breakdown of histogram bins.  Do this by adding the *parameter* `bins = 20` in between the parentheses of the function `hist()` like this: 
`weather["actual_precipitation"].hist(bins = 20)`

A parameter gives a function additional information.

Add a title and axes labels to your plot.

What do you notice about the daily precipitation distribution?  Is this an approximately normal distribution?  Why or why not?

### Section 1.6 Creating new columns and bar charts

Maybe it would be helpful to make a bar chart of the number of days with 0 precipitation and the number of days with >0 precipitation.  To do this, we will first make a new column in our DataFrame that contains a 1 if there was precipitation that day, and a 0 if there was not.

Define a function that outputs (returns) 1 if the input value is greater than 0, and 0 if it is not.


In [None]:
def is_positive(x):
  if x > 0:
    return 1
  else:
    return 0

Now we use the `apply()` function to pass the value from each row of the `actual_precipitation` column into our `is_positive()` function, and store the output for that row in a new column called `precipitated`.

In [None]:
weather["precipitated"] = weather["actual_precipitation"].apply(is_positive)

Display the `weather` DataFrame again to see what happened.

In [None]:
weather.head()

Is this what you expected?

Now let's make a bar chart of the `precipitated` column.  Remember we must compute the value counts first when making a bar chart.

In [None]:
precip_counts = weather["precipitated"].value_counts()
precip_counts.plot(kind = "bar")

Were there more days with or without precipitation?  Is this what you would expect?

Add axes labels and a title to the bar chart.

Instead of precipitation, we might be interested in the number of hot (or cold) days.  Create a bar chart showing the number of days with a max temperature greater than 80F and the number of days with a max temperature of 80F or less.  

(Alternatively, choose your own temperature cut-off.)

### Section 1.7 Filtering

What about if we only want to look at part of our data, such as the month of February?  Then we need to *filter* our dataset.

Let's plot the distribution of daily minimum temperatures in February.

First we need to filter our data to only includes rows from February.  Recall that we can access just the month part of the date by adding `.dt.month` to the `date` column.

In [None]:
feb_filter = weather["date"].dt.month == 2
weather[feb_filter]

Store this new DataFrame with only the February rows in a new variable, and then create a histogram of the distribution of the minimum daily temperatures in February 2015.

See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components for the different properties that can be used after dt.  For example to get the mean of the historical average precipitation for all Tuesdays:

In [None]:
tues_filter = weather["date"].dt.dayofweek == 1
tues_weather = weather[tues_filter]
tues_weather["average_precipitation"].mean()

Can you compute the standard deviation (`.std()`) of the record maximum temperature in the first quarter?

#### Optional challenge questions:
* Create a bar chart of the number of record maximum temperatures that occured in the year 2000 or later and the number that occured before 2000
* There is an English proverb that "March comes in like a lion and out like a lamb".  Is the mean temperature (or precipitation) at the beginning of March significantly lower (or higher) than at the end of March?
* What would you like to know about the NYC weather in this data set?  Plot or compute it.