# Case Study

Working with real-world weather and climate data, you will use pandas to manipulate the data into a usable form for analysis. Your task is compare observed weather data from two sources:

- Climate normals of Austin, TX from 1981-2010,
- Weather data of Austin, TX from 2011.

Source: National Oceanic & Atmospheric Administration, [www.noaa.gov/climate](www.noaa.gov/climate).

In [1]:
import pandas as pd

## Reading and cleaning the data

Upon inspection with a certain system tool, we find that the data appears to be ASCII encoded with comma delimited columns, but has no header and no column labels.

## Reading in a data file
Let's try to read one file. The problem with real data such as this is that the files are almost never formatted in a convenient way.

In this exercise, there are several problems to overcome in reading the file:
- First, there is no header, and thus the columns don't have labels. 
- There is also no obvious index column, since none of the data columns contain a full date or time.

Your job is to read the file into a DataFrame using the default arguments. After inspecting it, you will re-read the file specifying that there are no headers supplied.

In [3]:
data_file = 'data/weather_case_study.csv'

In [4]:
# Read in the data file: df
df = pd.read_csv(data_file)

# Print the output of df.head()
df.head()

Unnamed: 0.1,Unnamed: 0,13904,20110101,0053,12,OVC045,Unnamed: 7,10.00,.1,.2,...,.18,.19,29.95,.20,AA,.21,.22,.23,29.95.1,.24
0,0,13904,20110101,153,12,OVC049,,10.0,,,...,,,30.01,,AA,,,,30.02,
1,1,13904,20110101,253,12,OVC060,,10.0,,,...,30.0,,30.01,,AA,,,,30.02,
2,2,13904,20110101,353,12,OVC065,,10.0,,,...,,,30.03,,AA,,,,30.04,
3,3,13904,20110101,453,12,BKN070,,10.0,,,...,,,30.04,,AA,,,,30.04,
4,4,13904,20110101,553,12,BKN065,,10.0,,,...,15.0,,30.06,,AA,,,,30.06,


In [5]:
# Read in the data file with header=None: df_headers
df_headers = pd.read_csv(data_file, header=None)

# Print the output of df_headers.head()
df_headers.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,35,36,37,38,39,40,41,42,43,44
0,,13904,20110101,53,12,OVC045,,10.0,0.1,0.2,...,0.18,0.19,29.95,0.2,AA,0.21,0.22,0.23,29.95.1,0.24
1,0.0,13904,20110101,153,12,OVC049,,10.0,,,...,,,30.01,,AA,,,,30.02,
2,1.0,13904,20110101,253,12,OVC060,,10.0,,,...,30.0,,30.01,,AA,,,,30.02,
3,2.0,13904,20110101,353,12,OVC065,,10.0,,,...,,,30.03,,AA,,,,30.04,
4,3.0,13904,20110101,453,12,BKN070,,10.0,,,...,,,30.04,,AA,,,,30.04,


## Re-assigning column names
After the initial step of reading in the data, the next step is to clean and tidy it so that it is easier to work with.

In this exercise, you will begin this cleaning process by re-assigning column names and dropping unnecessary columns.

pandas has been imported in the workspace as pd, and the file NOAA_QCLCD_2011_hourly_13904.txt has been parsed and loaded into a DataFrame df. The comma separated string of column names, column_labels, and list of columns to drop, list_to_drop, have also been loaded for you.