# Introduction to pandas

## 1. Basic Data Structures in pandas

In this section I'll introduce the two main data structures in pandas: `DataFrame` and `Series`. I'll also show how you can perform some very basic data exploration and analysis of those structures.

### 1.1 Preliminaries 

If you are new to Python, then this first step might seem strange to you. When you have Python installed on your computer, it comes with a standard library of code packages that are specialized to perform certain tasks more efficiently than if you were to write the code "from scratch" for every task you want to perform. In addition to the standard library, there are many other packages (a.k.a. modules) that are easily downloadable from the internet. Indeed, this is exacly what pandas is... a Python package used for data analysis that is not in Python's standard library.

Python has a small subset of [built-in functions](https://docs.python.org/3/library/functions.html) that are always available to you. However, if you want to use the functionality from one of those additional modules, whether they are from the standard library or downloaded from the internet, you must `import` it before you use it. As such, the first thing that most people like to do at the top of their code is to import the modules that they plan to use. If you forget one or simply decide later on to use one not already imported, then you can do so at any point in the code.

In this next cell we `import` the modules necessary for this notebook. Some coders like to refer to this as a "preamble," but I'm not sure that is standard. For reasons beyond the scope of this tutorial, it is also common for users to set certain parameters in the preamble. For example, below I set a few parameters that make the figures look a bit nicer. In addition to the imports and figure settings, there is also what we call a [magic command](https://ipython.readthedocs.io/en/stable/interactive/magics.html): `%matplotlib inline`. This command simply tells the plotting module to produce the figures inline in the notebook.

In [1]:
# Any text on any line that is preceded by a '#' is simply a comment

# Most of the lines below will be part of our standard preamble.

%matplotlib inline

import pandas as pd # this is why we're here... to learn pandas
import numpy as np # fundamental package for scientific computing 
import matplotlib.pyplot as plt # this package makes the plots
import requests # this package assists with downloading data from the internet
import json # this package helps Python work with JSON objects

plt.style.use('ggplot') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5)
plt.rcParams['font.family'] = 'sans-serif'

**Note**: You probably noticed the use of the word "as" in those `import` statements above. This simply assigns a shorthand label to the package being imported. For example, when we want to use a function from pandas, we can refer to the module using `pd` instead of `pandas`. 

### 1.2 Load a dataset

We can't really do much without some data to work with, so first let's load a dataset into the workspace. One of the most widely used methods for storing data is in a file type called "Comma Separated Values" or CSV. A CSV file is simply a text file containing data. The name itself is kind of a misnomer, since the data in a CSV doesn't have to be separtated by a comma. It can be separated by a semicolon or tab or space or some other symbol, called a delimiter. As long as you know what the delimeter is, pandas will have no problem loading in the dataset. By default, pandas will assume the delimeter is a comma.

The dataset I want to open is `../data/Meteorite_Landings.csv`. Just in case you're not familiar with that syntax, `..` means to go up (back) one level in the directory tree. In that directory (folder) you'll see a directory called `data`. If you navigate into the `data` directory, you'll find the CSV file mentioned. The function I'll use to load (*read*) the CSV file is called `read_csv()`

In [2]:
meteorites = pd.read_csv('../data/Meteorite_Landings.csv')

And just like that, we have "loaded" all of the data from that CSV file into our workspace. For aesthetic reasons, Python doesn't make a habit of printing things to the screen unless you ask it to do so, that's why you don't see anything. But rest assured that the data contained in that CSV file now stored in the variable `meteorites`. Unless we modify it or delete it, we will have access to all that data while we are working in this notebook. Now let's learn how the data are arranged in this Python object and the specific data structures associated with it.

### 1.3 The `DataFrame` and `Series` structures

The first thing we should do is to simply look at the data. If you want to do so, you can open up the CSV file itself in a text editor. But the best way to view the data is right here in this notebook. Let's take a look at the top (a.k.a., head) of the dataset. We'll use the `.head()` method to do this. We can put a positive integer inside the () to tell Python how many rows (sometimes called "records") from the head of the dataset to print out. We don't have to do this; by default, it will print out five records.

In [3]:
meteorites.head()

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
0,Aachen,1,Valid,L5,21.0,Fell,01/01/1880 12:00:00 AM,50.775,6.08333,"(50.775, 6.08333)"
1,Aarhus,2,Valid,H6,720.0,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333,"(56.18333, 10.23333)"
2,Abee,6,Valid,EH4,107000.0,Fell,01/01/1952 12:00:00 AM,54.21667,-113.0,"(54.21667, -113.0)"
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,01/01/1976 12:00:00 AM,16.88333,-99.9,"(16.88333, -99.9)"
4,Achiras,370,Valid,L6,780.0,Fell,01/01/1902 12:00:00 AM,-33.16667,-64.95,"(-33.16667, -64.95)"


In [4]:
# To see what kind of oject we have
type(meteorites)

pandas.core.frame.DataFrame

It turns out that this object we created (called `meteorites`) is a pandas `DataFrame`. You can think of a `DataFrame` as a table of data, like something you'd create in a spreadsheet. It has two dimensions. Each row is one record of data. Each record consists of ten "pieces" (sometimes called *features*) of data. If we compare the text in the actual CSV file, we can see that the first row of data contains the labels for the various columns (features) of the dataset. 

Notice that the first column has no label and is a list of integers from 0 to 4. This is the `index` of the `DataFrame`. These values are not actually in the CSV file... go take a look. Each record in a `DataFrame` needs to have a unique identifier; we accomplish that using an `index`. **Note**: an `index` doesn't have to be sequential integers. It can be whatever we want it to be, but integers and dates and timestamps are very common. 

Also, notice that the first record has an index of 0, not 1. If you are not familiar with computer science, it may surprise you that we like to start counting with the integer *ZERO*, not with *ONE*. Let's not get into the reason why.

**How big is it?** If we want to see the size of the `DataFrame`, we have several options. If we just want to know how many records there are, we can use the `len()` built-in function for "length". If we want to know the shape, i.e., size in all dimensions, we use the `.shape` attribute of the pandas `DataFrame` object.

In [5]:
# This will show us the length of the DataFrame, i.e., the number of records
len(meteorites)

45716

In [6]:
# This will show us the entire shape of the DataFrame
meteorites.shape

(45716, 10)

Each column in the `DataFrame` is a pandas `Series`. A `Series` is a one dimensional collection of data, like a vector of data (if the mathematical concept of "vector" means anything to you).

In [9]:
# grab onto one of those columns and give it a label
r_class = meteorites.recclass

# print this new object to the screen
print(r_class)

# print the type of this object
type(r_class)

0                          L5
1                          H6
2                         EH4
3                 Acapulcoite
4                          L6
                 ...         
45711                 Eucrite
45712    Pallasite, ungrouped
45713                      H4
45714                      L6
45715                    L3.7
Name: recclass, Length: 45716, dtype: object


pandas.core.series.Series

Notice that it did not print out all 45716 entries in this `Series`. By default, the behavior of the `print()` function in this notebook is designed so that it won't print out very large amounts of data unless you force it to do so. Even then, there are limits to the amount of data that this notebook will print to the screen when prompted to do so.

Also notice that this `Series` no longer has the column label "recclass" from the `DataFrame` attached to it. This is because a pandas `Series` really does behave like a one-dimensional array of values. We can even "grab on to" one or more of the values by integer reference... just like we can with an array.

In [11]:
# print out the first element
r_class[0]

'L5'

In [12]:
# print out the first 100 elements
r_class[0:100]

0              L5
1              H6
2             EH4
3     Acapulcoite
4              L6
         ...     
95            LL6
96             L6
97             H6
98             H6
99            LL6
Name: recclass, Length: 100, dtype: object

So, this `DataFrame` of ours is simply ten `Series` of data appended together. If needed, we can create a list of column labels from our `DataFrame`.

In [14]:
# Get the column labels
cols = meteorites.columns
print(cols)

Index(['name', 'id', 'nametype', 'recclass', 'mass (g)', 'fall', 'year',
       'reclat', 'reclong', 'GeoLocation'],
      dtype='object')


In [15]:
# print out the first column label
cols[0]

'name'

In [16]:
# print out the first three column labels
cols[0:3]

Index(['name', 'id', 'nametype'], dtype='object')

In addition, we can also get the indices of any `DataFrame` and store them in a variable. In this case, doing so would hardly seem necessary since the indices are sequential integers. Nevertheless, if the indices were of some other data type, you might want or need that `index` stored in some variable to be used later.

In [17]:
# Get the indices
idx = meteorites.index
print(idx)

RangeIndex(start=0, stop=45716, step=1)


In [18]:
# print out the 43rd index (remember... we start counting at zero!)
idx[42]

42

In [19]:
# print out the 9th through 18th indices
idx[8:18]

RangeIndex(start=8, stop=18, step=1)