# Data science exploration
## Featuring @FremdBot temperature data
* Reminder 1: What is a data scientist?
  * Video: https://www.youtube.com/watch?v=i2jwZcWicSY
  

* Reminder 2: What is Pandas?
  * Pandas is a popular tool that data scientists use to analyze data sets
  * It is a library designed to make it easier to work with large datasets
  * Pandas offers users 2 new data structures:
    * series
    * dataframes
    
    
* Background for today's data exploration:
  * @FremdBot is a Twitter account that tweets out temperature data from Fremd https://twitter.com/fremdbot
  * Made from two Raspberry Pi computers, two temperature probes, and a camera
  * The data collected by @FremdBot is publicly available here:
      * https://thingspeak.com/channels/123549
      * https://thingspeak.com/channels/142171
      * This is an example of "open data" (data that is freely shared with the world)
      * Further reading: https://en.wikipedia.org/wiki/Open_data
   * Related terms: 
     * The Internet of Things (https://en.wikipedia.org/wiki/Internet_of_things)
     * Citizen Science (https://en.wikipedia.org/wiki/Citizen_science)

## The @FremdBot project has generated a great deal of data
* We will use Pandas and some of the tools of datascience to dive into this data
* The data structure we will be using is a dataframe:

In [None]:
# Use pandas to read in the temperature data


# Reads in a csv file (comma separated values)


# Show the last 3 entries in the file:


## Clean the data (remove erroneous temperatures)
* Cleaning data is the process of finding and deleting/fixing erroneous data
  * Further reading: https://en.wikipedia.org/wiki/Data_cleansing
  * In this example, the Raspberry Pi sometimes records an erroneous reading of 185 degrees F 


In [None]:
# Print length


# Trim bad data


# Print new length


## Look at all the temperatures (and sort from low to high)

In [None]:
# Sort the pandas dataframe by temperature


# Print the first 10 items


### Now create a single list (pandas series) of only temperatures 

In [None]:
# Create variable, store "temp" values


# Print first ten items


## Find the average temperature
* The average annual temperature in the Chicagoland area is usually between 49.5F and 52.5F, so this number seems reasonable:

In [None]:
# Import numpy


# Create list/array of temperatures


# Calculate and print the average



### Find all temperatures greater than 50

In [None]:
# Create variable, store only temps above 50


# Print the first ten values


### Find all temperatures greater than 90

In [None]:
# Create variable, store only temps above 90


# Print the first ten results


## Look at the 10 highest temperatures

In [None]:
# Create variable, store the last 10 temps


# Print the results


### When did those high temperatures occur?

In [None]:
# Create a variable, store only dates


# Print results


### Reformat the date and time of each entry 

In [None]:
# Import datetime
from datetime import datetime
new_dates = []
for date in temp_data["date_time"]:
    new_dates.append(datetime.strptime(date,'%Y-%m-%d %H:%M:%S UTC'))

print(new_dates[:10])

### Now to graph the temperature data...

In [None]:
# Import matplotlib
import matplotlib.pyplot as plot
%matplotlib inline

# Graph details
plot.figure(2,(25,10))
plot.style.use("ggplot")  # fivethirtyeight, bmh, grayscale, dark_background, ggplot
plot.title("Temperature Data from Fremd Bot")
plot.xlabel("Date")
plot.ylabel("Temperature")

# plot.scatter(x-axis data, y-axis data)
plot.scatter(new_dates,temp_data["temp"])
plot.show()

### Observations:
* Sometimes a good visualization can show us something that we don't notice in the full data set
  * What do you notice about this scatterplot that wasn't obvious in the original dataset?
  * Is this dataset complete?
  * What might have happened with the data?
    * Instructor note: Two main things happened to the data. Sometimes the .py script that recorded temperatures would crash
    and need to be rebooted. Othertimes the power to our raspberry pi device would be cut off :( Both of these scenarios would cause gaps in the data.
  * Should a dateset be disregarded or deleted because it is imperfect? 

## Try again with indoor temperatures (Room 223):

In [None]:
# Import pandas
import pandas as pd

# Read in the csv file (comma separated values)
indoor_orig = pd.read_csv('Room 223 2019.csv')

# Clean the data (remove entries over 120F)
indoor_clean = indoor_orig[indoor_orig.temp<120]

# Reformat the dates and times
from datetime import datetime
indoor_dates = []

for date in indoor_clean["date_time"]:
    indoor_dates.append(datetime.strptime(date,'%Y-%m-%d %H:%M:%S UTC'))

# Print first 10 results
print(indoor_dates[:10])

### Plot:

In [None]:
# Import matplotlib
import matplotlib.pyplot as plot
%matplotlib inline

# Graph details
plot.figure(2,(25,10))
plot.style.use("ggplot")  # fivethirtyeight, bmh, grayscale, dark_background, ggplot
plot.title("Indoor Temperature Data from Fremd Bot")
plot.xlabel("Date")
plot.ylabel("Temperature")

#plot.scatter(x, y, s=area, c=colors, alpha=0.5)
plot.scatter(indoor_dates,indoor_clean["temp"],color="green")
plot.show()

# Task 1
* What do you notice about this scatterplot compared to the outdoor temperature scatterplot?

# Task 2
* Show the top 10 hottest temperatures recorded in Room 223
* What was the date and time of the hottest temperature recorded in Room 223?

# Task 3
* Show the 10 coldest temperatures recorded in Room 223
* What was the date and time of the coldest temperature recorded in Room 223?

# Task 4
* What was the average temperature in Room 223 based on the provided dataset?

# Challenge
* Use the outdoor temperature data to determine the date and magnitude of the largest 24-hour temperature swing
  * For example, suppose that it was 50 degrees F on 1/5/17 at 5:00pm and -10 degrees F on 1/6/17 at 3:00am
  * If this were the case, then this would be a 24-hour temperature swing of -60 degrees 