# Case Study (Timeseries in pandas)

Working with real-world weather and climate data, you will use pandas to manipulate the data into a usable form for analysis. Your task is compare observed weather data from two sources:

- Climate normals of Austin, TX from 1981-2010,
- Weather data of Austin, TX from 2011.

Source: National Oceanic & Atmospheric Administration, [www.noaa.gov/climate](www.noaa.gov/climate).

In [1]:
import pandas as pd

## Reading and cleaning the data

Upon inspection with a certain system tool, we find that the data appears to be ASCII encoded with comma delimited columns, but has no header and no column labels.

## Reading in a data file
Let's try to read one file. The problem with real data such as this is that the files are almost never formatted in a convenient way.

In this exercise, there are several problems to overcome in reading the file:
- First, there is no header, and thus the columns don't have labels. 
- There is also no obvious index column, since none of the data columns contain a full date or time.

Your job is to read the file into a DataFrame using the default arguments. After inspecting it, you will re-read the file specifying that there are no headers supplied.

In [3]:
data_file = 'data/weather_case_study.csv'

In [4]:
# Read in the data file: df
df = pd.read_csv(data_file)

# Print the output of df.head()
df.head()

Unnamed: 0.1,Unnamed: 0,13904,20110101,0053,12,OVC045,Unnamed: 7,10.00,.1,.2,...,.18,.19,29.95,.20,AA,.21,.22,.23,29.95.1,.24
0,0,13904,20110101,153,12,OVC049,,10.0,,,...,,,30.01,,AA,,,,30.02,
1,1,13904,20110101,253,12,OVC060,,10.0,,,...,30.0,,30.01,,AA,,,,30.02,
2,2,13904,20110101,353,12,OVC065,,10.0,,,...,,,30.03,,AA,,,,30.04,
3,3,13904,20110101,453,12,BKN070,,10.0,,,...,,,30.04,,AA,,,,30.04,
4,4,13904,20110101,553,12,BKN065,,10.0,,,...,15.0,,30.06,,AA,,,,30.06,


In [10]:
# Read in the data file with header=None: df_headers
df = pd.read_csv(data_file, header=None)
df = df.drop([0], axis=1)

# Print the output of df_headers.head()
df.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,35,36,37,38,39,40,41,42,43,44
0,13904,20110101,53,12,OVC045,,10.0,0.1,0.2,0.3,...,0.18,0.19,29.95,0.2,AA,0.21,0.22,0.23,29.95.1,0.24
1,13904,20110101,153,12,OVC049,,10.0,,,,...,,,30.01,,AA,,,,30.02,
2,13904,20110101,253,12,OVC060,,10.0,,,,...,30.0,,30.01,,AA,,,,30.02,
3,13904,20110101,353,12,OVC065,,10.0,,,,...,,,30.03,,AA,,,,30.04,
4,13904,20110101,453,12,BKN070,,10.0,,,,...,,,30.04,,AA,,,,30.04,


## Re-assigning column names

After the initial step of reading in the data, the next step is to clean and tidy it so that it is easier to work with.
In this exercise, you will begin this cleaning process by re-assigning column names and dropping unnecessary columns.

In [7]:
column_labels = 'Wban,date,Time,StationType,sky_condition,sky_conditionFlag,visibility,visibilityFlag,wx_and_obst_to_vision,wx_and_obst_to_visionFlag,dry_bulb_faren,dry_bulb_farenFlag,dry_bulb_cel,dry_bulb_celFlag,wet_bulb_faren,wet_bulb_farenFlag,wet_bulb_cel,wet_bulb_celFlag,dew_point_faren,dew_point_farenFlag,dew_point_cel,dew_point_celFlag,relative_humidity,relative_humidityFlag,wind_speed,wind_speedFlag,wind_direction,wind_directionFlag,value_for_wind_character,value_for_wind_characterFlag,station_pressure,station_pressureFlag,pressure_tendency,pressure_tendencyFlag,presschange,presschangeFlag,sea_level_pressure,sea_level_pressureFlag,record_type,hourly_precip,hourly_precipFlag,altimeter,altimeterFlag,junk'
list_to_drop = ['sky_conditionFlag', 'visibilityFlag', 'wx_and_obst_to_vision', 'wx_and_obst_to_visionFlag', 'dry_bulb_farenFlag', 'dry_bulb_celFlag', 'wet_bulb_farenFlag', 'wet_bulb_celFlag', 'dew_point_farenFlag', 'dew_point_celFlag', 'relative_humidityFlag', 'wind_speedFlag', 'wind_directionFlag', 'value_for_wind_character', 'value_for_wind_characterFlag', 'station_pressureFlag', 'pressure_tendencyFlag', 'pressure_tendency', 'presschange', 'presschangeFlag', 'sea_level_pressureFlag', 'hourly_precip', 'hourly_precipFlag', 'altimeter', 'record_type', 'altimeterFlag', 'junk']

In [11]:
# Split on the comma to create a list: column_labels_list
column_labels_list = column_labels.split(',')

# Assign the new column labels to the DataFrame: df.columns
df.columns = column_labels_list

# Remove the appropriate columns: df_dropped
df = df.drop(list_to_drop, axis='columns')

df.head()

Unnamed: 0,Wban,date,Time,StationType,sky_condition,visibility,dry_bulb_faren,dry_bulb_cel,wet_bulb_faren,wet_bulb_cel,dew_point_faren,dew_point_cel,relative_humidity,wind_speed,wind_direction,station_pressure,sea_level_pressure
0,13904,20110101,53,12,OVC045,10.0,51,10.6,38,3.1,15,-9.4,24,15.1,360,29.42,29.95
1,13904,20110101,153,12,OVC049,10.0,51,10.6,37,3.0,14,-10.0,23,10.0,340,29.49,30.01
2,13904,20110101,253,12,OVC060,10.0,51,10.6,37,2.9,13,-10.6,22,15.0,10,29.49,30.01
3,13904,20110101,353,12,OVC065,10.0,50,10.0,38,3.1,17,-8.3,27,7.0,350,29.51,30.03
4,13904,20110101,453,12,BKN070,10.0,50,10.0,37,2.8,15,-9.4,25,11.0,20,29.51,30.04


## Cleaning and tidying datetime data

In order to use the full power of pandas time series, you must construct a `DatetimeIndex`. To do so, it is necessary to clean and transform the date and time columns.

Your job is to clean up the date and Time columns and combine them into a datetime collection to be used as the `Index`.