# Loading and Describing Data Using Pandas
## Notebook Outline:

* <a href='#IntroToPandas'>Introduction To Pandas</a>
* <a href='#LoadingStandardCSV'>Loading Standard CSV</a>
* <a href='#BasicDataDescription'>Basic Data Description</a>
* <a href='#LoadingTabDataFile'>Loading Tab DataFile</a>
* <a href='#LoadingWhiteSpaceFile'>Loading White Space File</a>
* <a href='#WritingOutToCSV'>Writing Data Out To A CSV</a>
* <a href='#LessonSummary'>Lesson Summary</a>

# How to use this Notebook

The best way to use this notebook is to follow along with the lecture and then to apply what you learn to your own data files, or (if you do not have any of your own data) to practice using this functions and methods on the provided data. A little practice goes a long way towards understand and retaining! It would be easy to just skim this notebook, but you will learn more by doing!

#  Introduction To Pandas - What is Pandas?
### From the Pandas website:
http://pandas.pydata.org/

'pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.'

If you are familiar with Excel, think of Pandas as a similar tool to explore and analyze data. There are big differences between Pandas and Excel (Pandas is faster, can handle larger datasets more efficiently, and can do more overall, but does not have GUI), but they can be used for similar purposes and having that comparison in your mind may help you digest the information.

### My experience with Pandas:

I use Pandas everyday, along with Jupyter Notebook, to explore and analyze client data. It is an integral part of my real-world-workflow.

<a name='LoadingStandardCSV'></a>
# Loading A Standard CSV File With Pandas
In the below cells, we import pandas. Then, we load a file of the most popular baby boy names used in Illinois from 1980 to 2013. Please see the comments in each cell below for more details about the code in each cell. Also - please see the lecture videos that walk through this notebook!

Also, we will be using the `read_csv()` method extensively, and introducing some of its arguments.  If you'd like, you can refer to its documentation here <https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html>

In [1]:
# First we must import pandas.  It is very common to import pandas as pd.  All
# this means is that I can refer to pandas as 'pd' in my code - saving myself
# from typing 4 more characters and also saving space.

import pandas as pd
import os

In [2]:
# Next, we need to define the filepath to our file.  We will go over this in 
# the lecture. Also note that I wrap the path in parentheses, this allows me
# to write the string on multiple lines. This just keeps my code tidy. I will
# explain this more in the video.


filepath = os.path.join(os.getcwd(), 'data', 'Most_Popular_Baby_Boy_Names__1980-2013.csv')

In [None]:
# Now we can load our data.  It is pretty simple, we just use the read_csv()
# method. Method docs can be found here:
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

print(filepath)
nameData = pd.read_csv(filepath)

In [None]:
# Let's print the type of the object we just created
print(type(nameData))



<a name='BasicDataDescription'></a>
# Basic Data Description

Now we start our basic data description!  Unfortunately, "basic" can sometimes sound like something is not interesting, or not "the good stuff" - this is definitely not true in this case. It is realtively simple, but it is very important to have a solid high-level understanding of your data before you dive in deeper. If you skip this, you will end up paying for it later.

#### The .head() method can be used to get the first n lines of a dataframe. It is always a good idea to just 'look' at your data.

In [None]:
# Below we print the first 3 lines of the data file. The default number of
# lines printed is 5
nameData.head()

#### The .tail() method can be used to get the last n lines of a dataframe.

In [None]:
nameData.tail()

We just learned something by looking at the data - it looks like the names were entered in all caps in some of the data. This will be important later.

#### The .sample() method can be used to get a random smaple of n rows from the dataframe.

In [None]:
nameData.sample(10)

#### The .shape _attribute_ will tell us the size of the file; the number of rows and the number of columns.

In [None]:
nameData.shape

#### The .columns attribute will tell us the names of the columns.

In [None]:
nameData.columns

#### The .dtypes attribute will display the variable type of each column.
* This can be helpful in detemrining what the contents of each column is (see the auto mpg example below).
* 'object' is used for strings or other variable types thare not numbers or dates.  For example, lists or tuples, which can be stored in a dataframe, but that is rare - most of the time, when you see 'object' it means the column contains strings.


In [None]:
nameData.dtypes

#### The .info() method will also tell is the datatype, but with some additional info about the size of the dataframe and the number of non-null entries. 
A null entry would be one that is _empty_ in the dataset.  Remember that sometimes the dataset already comes with null or missing values marked with a special value, like -9999 (we will see this in the weather data example). Pandas will not immediately recognize this as a null value.

In [None]:
nameData.info()

#### The .memory_usage() method gives the size of each column in bytes.
Note that if you add these together and divide by 1024 (1024 bytes = 1 KB), you get the same number that is shown in the output from .info()

In [None]:
nameData.memory_usage()

#### The .describe() method outputs basic descriptive statistics about all of the _numerical_ columns in the dataframe.

In [None]:
nameData.describe()

#### The .unique() method will output the unique values in a column.
In order to get a column from a dataframe, simple put the column name in square brackets after the dataframe variable. For example, we use nameData['Name'] below to get the name column of the dataframe. (We will cover indexing and slicing of dataframes in greater detail in a following lesson.)

In [None]:
nameData['Name'].unique()

#### The .nunique() method will output the number of uniuqe values in a column

In [None]:
nameData['Name'].nunique()

#### The .value_counts() method will output the number of times each value occurs in a column. 
For example, we see that "Christoper" has been ranked 26 times, and 'CHRISTOPHER' 5 times.  So, in actuallity, the name has been ranked 31 times. We will come back to this in a future lecture.

In [None]:
nameData['Name'].value_counts()

# Brief summary of what you have learned so far:
* head(n) - get the first n rows
* tail(n) - get the last n rows
* sample(n) - get a random sample of n rows
* shape - get the number of rows and columns
* columns - get the column names
* dtypes - get the variable types of each column
* info() - get the variables types, non-null counts, and memory size of the DataFrame
* memory_usage() - get the memory usage of each column of the data frame
* describe() - get basic summary statistics about each numerical column
* unique() - get the unique values in a column
* nunique() - get the number of unique values in a column
* value_counts() - get the occurence counts for each value in a column

<a name='LoadingTabDataFile'></a>
# Loading A Data File With Tab Separated Fields
Great work so far! You have learned how to use pandas to get a high-level description of your dataset.  We are now going to apply these same functions to another dataset, and also learn some new functionality (that I use often) in the process.

The next dataset we are going to load is a dataset of car models made from 1970 to 1982. The dataset includes the following attributes of each model: The mpg, number of cylinders, engine displacement, horsepower, weight, acceleration (m/s^2), model year and car name.

#### Introducing the 'sep' argument in the read_csv() method.
The sep argument allows us to specify the field separator that pandas should use when attempting to read in the data. Below, we set it to the tab escape sequence which is '\t'. (This just means that '\t' indicates a tab). Note that the default value for the 'sep' argument is ',' which is why we do not have to set it when reading in comma separated data.

In [None]:
filepath = os.path.join(os.getcwd(), 'data', 'auto-mpg-tabs.csv')

autoMPGData = pd.read_csv(filepath, sep='\t')

We now use the .head() method to look at our data.

In [None]:
autoMPGData.head()

#### Introducing the index_col argument to the read_csv() method.

Notice the first column 'Unnamed: 0'. The reason we see this in the dataframe is because this file already has an index column (see the screenshot below).  Pandas always automatically adds its own index column. So, it treats the index column in the file as a column of data. Since this column has no header in the file, it gives it a generic heading of 'Unnamed: 0'. We can use the 'index_col' argument when reading in a csv to indicate which column, already present in the datafile, we would like to use as the index.  In this case, we want to use the first column. Remember that Python is zero-indexed, so the first column will be column 0.

In [None]:
# Note how we use the index_col argument to read in the first column, in the data file, as the index.
autoMPGData = pd.read_csv(filepath, sep='\t', index_col=0)

Notice below that we no longer have the extra 'Unnamed: 0' column when we use .head() below to get the first few lines.

In [None]:
autoMPGData.head()

#### Using .shape, .info() and describe() to better understand the data set.
Notice below how the horsepower data type is 'object' and not 'int64' or 'float64'.  Horsepower is a number, so we would expect the datatype to be an int or float.  But pandas as recognized it as 'object' (which means that pandas has recognized the column as a column of strings).  This is unexpected, and means that there probably is a string in the data! We will see what it is using some of the other methods we have learned.

In [None]:
# shape
autoMPGData.shape

In [None]:
# info()
autoMPGData.info()

In [None]:
# describe()
autoMPGData.describe()

#### Taking a closer look at the hourspower column using .unique()
This is not the only way to find the bad value. But this is one way, using a method we have learned so far. We will see some other possibilities in coming lectures.

In [None]:
# unique()
autoMPGData['horsepower'].unique()

In [None]:
# memory_usage()
autoMPGData.memory_usage()

<a name='LoadingWhiteSpaceFile'></a>
# Loading A Data File With Fields Delimited By White Space
We will now look at one more data file. This file is from the isd-lite data that can be found here: <ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite>

These files contain weather observations from weather stations all over the world.  We will look at the 2001 data for the station 724080-13739 which is a station at the Philadelphia International Airport.

This particular data is delimited by white space. White space can mean a number of things: tabs, spaces, new lines.  In this case it just means spaces; see the screen shot below.

#### Introducing the 'delim_whitespace' argument
We can use a special argument when a datafile is separated by an undetermined amount of white space. That is, field could be separated by different number of spaces, or tabs and spaces etc..

In [None]:
filepath = os.path.join(os.getcwd(), 'data', 'Philadelphia_Pennsylvania_USA/724080-13739-2001')
weatherData = pd.read_csv(filepath, delim_whitespace=True)

In [None]:
weatherData.head()

#### How to set column names for datafiles without column names.
Notice in the above cell, we see that pandas reads the first line as the column names. But, in this file, there are no names and the first line is data. There are two different strategies to solve this.

#### How to set column names by using the 'names' argument when we read in the data.
If we know what the column names should be, we can pass them to the names argument as a list, and pandas will automatically apply the names to the columns when it reads in the data.

We know what the column names should be, by looking at the data documentation which is here: <ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite/isd-lite-format.pdf>

In [None]:
headers = ['Year', 'Month', 'Day', 'Hour', 'Air Temp', 'Dew Point Temp',
           'Sea Level Pressure',
           'Wind Direction', 'Wind Speed Rate',
           'Sky Condition Total Coverage Code',
           'Liquid Precipitation Depth Dimension - 1Hr Duration',
           'Liquid Precipitation Depth Dimension - Six Hour Duration']
print(headers)
weatherData = pd.read_csv(filepath, delim_whitespace=True,
                          names=headers)

In [None]:
weatherData.head(2)

#### Using the 'header' argument and setting the columns after we read in the file.
Another method is to use the 'header' argument to prevent pandas from reading in (and applying) any column names and then setting the columns names with the column attribute. See below how we set the header argument to None. The default value is 0, which means that pandas will try to read the first row as the header of the data file (the column names).  Remember that python is zero-indexed, so a value of 0 indicates the first row. By setting header to None we are "telling" .read_csv() that it should not treat any row as the headers when reading the file, and it will just number the columns 0 through 11.

When then set the columns attribute to be the list of column names the we defined above.

In [None]:
weatherData = pd.read_csv(filepath, delim_whitespace=True, header=None)
weatherData.head()

In [None]:
weatherData.columns

In [None]:
weatherData.columns = headers
weatherData.head()

In [None]:
weatherData.columns

#### What are the -9999 values?
You have probably noticed the -9999 values in the 'Liquid Precipitation Depth Dimension - Six Hour Duration' column.  Without knowing anything more, we should be very suspicious that this is a special value indicating a missing value.  If we look at the data documentation linked in a previous cell, we will see that -9999 is used as a missing value.  We will come back to missing values in a future lecture, and we will specifically look at this example.  For now, we note it and move on.

#### Using .shape, .dtypes, .info(), and .describe() to take a closer look at the weather data.

In [None]:
weatherData.shape

In [None]:
weatherData.dtypes

In [None]:
weatherData.info()

In [None]:
weatherData.describe()

#### Investigating The Sky Condition Total Coverage Code Using value_counts()

In [None]:
# 0 - No Clouds
# 2 - 2 Oktas
# 4 - 4 Oktas
# 6 - 6 Oktas
# 7 - 7 Oktas
# 8 - 8 Oktas
# 9 - Sky obscured or cloud amount can not be estimated
# -9999 - Missing
weatherData['Sky Condition Total Coverage Code'].value_counts()

In [None]:
weatherData['Year']+10

In [None]:
# Here we divided by the total rows and multipled by 100 to get the % of each
# cloud cover type in the data.
(weatherData['Sky Condition Total Coverage Code'].value_counts() / weatherData.shape[0]) * 100

#### Note that many of the methods and attributes we have used return DataFrames or Series as output. 
A series is like a dataframe except that is one dimensional.
Note below how I show that the types of the returned output from the .value_counts() and .describe() methods are dataframes and series.  This is true for many of operations we will apply to dataframes. Just something to keep in mind as we continue to learn.

<a name='WritingOutToCSV'></a>
# How To Write Data Back Out To A CSV

To write data out to a csv, we can use the `to_csv()` method. The doc are here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

Below, let's use `to_csv()` to write out our weatherData dataframe

In [None]:
# First, create the path to the file we would like to create
save_path = os.path.join(os.getcwd(), 'data', 'Philadelphia_Pennsylvania_USA/724080-13739-2001_out')

# Then use to_csv
weatherData.to_csv(save_path)

#### Now, let's read it back in and look at the file:

In [None]:
weatherData2 = pd.read_csv(save_path)
weatherData2.head()

#### Notice there is now another column - why?

The reason there is another column, is that when we use `to_csv()`, pandas writes out the index to the file. But, when we read in the file it will treat that column of a data as a column of data and not an index. So, we need to either use the `index_col` argument with `read_csv()` to read in the column as an index (like we saw above), or we can set the `index` argument to False when we use `to_csv()`.

Let's see this below.

In [None]:
weatherData.to_csv(save_path, index=False)
weatherData2 = pd.read_csv(save_path)
weatherData2.head()

<a name='LessonSummary'></a>
# Lesson Summary:
In this lesson you learned about the following:
* Methods and attributes that help describe data files:
    * head(n) - get the first n rows
    * tail(n) - get the last n rows
    * sample(n) - get a random sample of n rows
    * shape - get the number of rows and columns
    * columns - get the column names
    * dtypes - get the variable types of each column
    * info() - get the variables types, non-null counts, and memory size of the dataframe
    * memory_usage() - get the memory usage of each column of the data frame
    * describe() - get basic summary statistics about each numerical column
    * unique() - get the unique values in a column
    * nunique() - get the number of unique values in a column
    * value_counts() - get the occurrence counts for each value in a column
<br>
<br>
* Arguments to the read_csv() method that help you read in various file types:
    * sep - an argument that allows you to specify the field separate used (we saw commas and tabs)
    * index_col - an argument to specify the column used as the index
    * names - an argument to specify column names
    * header - an argument to specify which row to use as the header
    * columns - an attribute that can be set, to change the column names of a dataframe
<br>
<br>
* How to use to_csv() to write out data and how to use the index argument.


## In Class Exercise
In the cells below, Load the file "AAA_Fuel_Prices.csv" and use some of the methods we learned above to explore it.

In [None]:
filepath = os.path.join(os.getcwd(), 'data', 'AAA_Fuel_Prices.csv')

aaaFuel=pd.read_csv(filepath)

aaaFuel.head()

In [None]:
aaaFuel.info()

In [None]:
aaaFuel.describe()

In [None]:
aaaFuel['Fuel'].unique()
aaaFuel['Fuel'].nunique()
aaaFuel['Fuel'].value_counts()

## Question or Comments About This Notebook?
Feel free to contact me via my LinkedIn: https://www.linkedin.com/in/william-j-henry <br>
You can also email me at will@henryanalytics.com <br>