# Introduction to pandas

*I'd like to thank Nick Ross who graciously allowed me to derive pieces of this notebook and others from his amazing SQL/pandas course notes.*

Pandas has two primary entities that you must be careful to distinguish to avoid getting confused:

* `DataFrame` is 2D tabular data structure; it has both rows and columns
* `Series` is a 1D array (colume) data structure

While a series looks like  a column from a data frame, they are really separate kinds of objects with different sets of functions that you can apply to them.

Pandas lets you do a lot of querying, merging, and aggregation just like a database, but these dataframes only exist in memory. That means:

* you can only operate on data that fit in memory, as opposed to the disk
* dataframes you construct disappear when you're Python program terminates
* pandas is not suitable for use on problems requiring multiple computers

Pandas dataframes will be your primary data structure until machine learning, when we discuss building more complicated data structures such as decision trees.

## How to start using pandas

First, we have to tell Python that we want to use pandas. I also tell it to import numpy, because I often want to use both of these libraries in conjunction with each other:

In [1]:
import numpy as np
import pandas as pd

Let's load some data from the `data` subdirectory under this `notebooks` directory.  You can download this file onto your own computer wherever you want, but make sure that you specify the appropriate file name when loading it with pandas.

In [3]:
!ls data  # Anything after ! char is sent to Terminal for execution

cars.csv


In a spreadsheet, the start of that file looks like:

<img src="images/excel.png" width="250">

In [3]:
df_cars = pd.read_csv("data/cars.csv")
df_cars.head()

Unnamed: 0,MPG,CYL,ENG,WGT
0,18.0,8,307.0,3504
1,15.0,8,350.0,3693
2,18.0,8,318.0,3436
3,16.0,8,304.0,3433
4,17.0,8,302.0,3449


### Exercise

0. Launch `jupyter lab` and create a notebook for the exercises in today's class. I like to create a notebooks directory associated with my topic or class and then launch juypter from the terminal in that directory so that I know where files are being created.
1. Download the `cars.csv` file and save it either in the same directory as your notebook or in a `data` subdirectory.  You have to get use to being very organized and paying attention to the structure of your directories when referring to files on the disk.
2. Import pandas and read in the `cars.csv` file into a data frame called `df_cars`, just to be consistent with your fellow students and the instructor.
3. Print the first few rows of that data frame.