# Python Data Science Bite-Sized Lesson:<br>Introduction to Pandas and Matplotlib

**Author**: Michelle Franc Ragsac (mragsac@eng.ucsd.edu)

---

**Notebook Information:**

This Jupyter Notebook contains information on the basic functionality of the `pandas` and `matplotlib` packages in Python for Module 4: Introduction to HPC. <br>It's running with a `Python 3` kernel! 

---

## Import Necessary Packages for the Module

Before we start coding, we want to import the `modules` that we'll be using in our notebook. This is the same as importing the modules at the beginning of a Python script. 

For the three packages we'll be going through for this series of notebooks, they have different conventions for how they're called in people's code. The shorthand for `numpy` is `np`, `pandas` is `pd`, and `matplotlib.pyplot` is `plt`. If you find any code online (e.g., through StackOverflow) and you see these terms, these are the packages they're usually referring to! 

In [1]:
import numpy as np                 # adds support for large, multi-dimensional arrays and optimized linear alg 
import pandas as pd                # adds support for Excel-like table operations (i.e. R Data Frames)
import matplotlib.pyplot as plt    # adds support for plotting in Python

In addition to importing these packages, there is a special line that we can add to view any plots generated with `matplotlib` within our notebook as part of the output of a code cell! 

This special line is called a "magic function"! 

**Documentation on Magic Functions in Jupyter Notebooks**: https://ipython.readthedocs.io/en/stable/interactive/magics.html

In [2]:
%matplotlib inline

---

## Introduction to Pandas

Pandas is an open-source data analysis package in Python that aids in data analysis and manipulation using `DataFrame` and `Series` objects. These `DataFrame` objects can be thought of as multidimensional arrays with attached row and column tables, similar to the tables found in Excel Spreadsheets or `data.frame` objects in the R Programming Language. The `Series` objects can be thought of as a single column with connected row and column labels. Pandas also has tools for reading and writing data between in-memory structures and different commonly-used formats, such as CSV, text files, Microsoft Excel sheets, SQL databases, and the HDF5 format. 

**Pandas Website**: https://pandas.pydata.org/
<br>**Pandas Code Base on GitHub**: https://github.com/pandas-dev/pandas

---

## Creating a Pandas `DataFrame` from Scratch

To start off, we'll learn how to create a `DataFrame` object from scratch. This is an important skill to learn as it helps you test new methods and functions you might find in the Pandas documentation. One of the easisest ways to create a `DataFrame` from scratch is to use a `dict` object. 

<div class="alert alert-block alert-info">
    <b>Note:</b> Dictionaries (<code>dict</code>) are a hash table sstructure that is built into Python. They're unordered collections of items where each item has an associated key-value pairing. Dictionaries are optimized to retrieve values when a key is provided. 
</div>

In [6]:
# Create a dictionary called data and populate each key with its values (key : value)
data = {
    'tacos': [1, 3, 2],
    'burritos': [1, 1, 2]
}

# Create a list called names containing what we want to label our indices in our dataframe
names = ['Michelle', 'Cameron', 'Owen']

# Pass the dictionary to the DataFrame constructor to create a dataframe called purchases
purchases = pd.DataFrame(data, index=names)
purchases # preview the dataframe in the notebook

Unnamed: 0,tacos,burritos
Michelle,1,1
Cameron,3,1
Owen,2,2


Now, we have a `DataFrame` object called `puchases` containing a table of how many `burritos` and `tacos` each person purchased! 

## Selecting Rows in a `DataFrame` Based on its Index Value

Say we wanted to know how many `tacos` and `burritos` Cameron purchased. We can select rows within the `DataFrame` object using the `loc()` method:

In [7]:
purchases.loc['Cameron']

tacos       3
burritos    1
Name: Cameron, dtype: int64

From this command, we can see that Cameron purchased three `tacos` and one `burrito`! 

---

## Reading in Data into a `DataFrame` Object