# Introduction to Pandas


Now that you guys are Python masters, we are going to move straight into learning more about how to use a specific Python library that is super popular and used by data scientists and analysts around the world. 

### What is it (a high-level overview)?

You can probably guess what it's called by the title (HINT: It's called Pandas.). Pandas is powerful because it allows you to work with data without having to write a bunch of conditionals / loops like you guys learned about earlier. Instead, Pandas relies on reading data input into objects that are easier to deal with!

Some of the features of Pandas in an overview include:

* Types of labeled arrays, main ones being Series/TimeSeries (1-dim arrays) and DataFrame (2-dim arrays)
* Index objects allowing for single and multi-axes indexing
* Ability to append and transform datasets / data input fairly easily
* Date range generation and custom date offsets
* Input/Output tools: loading data from CSVs or other flat files and loading into tabular objects called PyTables
* Rolling mean, rolling standard deviation, etc. with changing inputs
* Static and Rolling regression + analysis

## Let's get started!
<img src="https://cdn-images-1.medium.com/max/1200/1*tiFm2E0nCXp4Bc1Rk8OhdA.jpeg" width="300" heigh="300">


## Imports

We're gonna get started with importing necessary libraries so we can actually practice using the Pandas library. The others to import include NumPy and MatPlotLib - [NumPy](http://www.numpy.org/) is a Python Library and is used here for it's powerful and easy-to-do-matrix-math-with array objects, while [MatPlotLib](https://matplotlib.org/) is used to visualize our data, natural and / or modified. You can click on the links to learn more about them, but we're not gonna go into details for now.  

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Data Structure Introduction

### Series - object "wrapper" of sorts for a 1D array

In [None]:
series = pd.Series([1.0, 3.0, 5.0, 7.0, 9.0])
series

In [None]:
series[2] #Indexing as we do a typical list / array in vanilla python

In [None]:
series = pd.Series(np.arange(4), index=['one','two','three','four']) # can have values and indeces as well...
series.values

In [None]:
series.index 

In [None]:
series = pd.Series({'one':1,'two':2,'three':3,'four':4}) #... like a dictionary!
series.values

In [None]:
series.index

### DataFrame - object "wrapper" of sorts for a 2D array

In [None]:
df = pd.DataFrame({'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']})

In [None]:
df_area = df['area'] #accessing like before...
df_area

In [None]:
df.info() #giving us some information...

In [None]:
df["population per area in millions"] = df['population'] / df['area'] #You can create new columns
df["population per area in millions"]

In [None]:
#QUESTION: What if we wanted to get a series from a DataFrame?

## Inputting Data

As mentioned earlier, we are able to input data from various sources - one type of file is the CSV file.
Let's read in a CSV file containing data about different airlines into a DataFrame - we'll talk more about what that means later.

In [None]:
#airlines_df = pd.read_csv('airlines.csv')

#Go to https://think.cs.vt.edu/corgis/python/airlines/airlines.html 
#and download the csv file to you Drive before you start this

from google.colab import drive 
drive.mount('/content/gdrive')

In [None]:
airlines_df = pd.read_csv('gdrive/My Drive/airlines.csv')
airlines_df

## Data Overview

We can get an overview of the dataset statistics, using built in methods as described (and available to you to try out) below...

In [None]:
print("head(num_rows): Printing first 5 rows of dataset...")
airlines_df.head() #5 is default

In [None]:
print("tail(num_rows): Printing last 3 rows of dataset...")
airlines_df.tail(3)

In [None]:
print("describe(): A statistical summary of the dataset...")
airlines_df.describe()

In [None]:
print("columns: lists the columns within the dataset...")
airlines_df.columns

In [None]:
print("index: lists the indices of the dataset...")
airlines_df.index

## Data Selection / "Slicing"

If we wanted to look at a subset of our dataset, persay only certain columns or a few rows or some combination of the two, we are able to easily look at some specific "slice" of our dataset using what was taught before about array accessing and slicing (if you don't remember, don't worry! Comments below will give brief but informative explanations of what is going on.)

### Rows

In [None]:
airlines_df[2:5] #slice taking rows 2 through 5-1 (=4)

In [None]:
airlines_df[6:] #slice taking rows 6 through end (4407, to be exact)

##### Rows - by location

In [None]:
airlines_df.iloc[[2:5,7:]] #What do you think this means? Hint - we are accessing specific rows, not ranges.

In [None]:
airlines_df.iloc[[2,5,9]] #What do you think this means? Hint - we are accessing specific rows, not ranges.

### Columns

In [None]:
airlines_df['# of Delays.Late Aircraft'] #slice taking column '# of Delays.Late Aircraft'

##### Multiple columns by label

In [None]:
airlines_df.loc[:,['# of Delays.Late Aircraft','Month Name']]

In [None]:
#Write your own in here!!

### Boolean / Where

In [None]:
#Using Where
air_df_sub = airlines_df.where(airlines_df['# of Delays.Late Aircraft'] > 100)
air_df_sub

## Setting Values / Editing Set

In [None]:
# Entire Column
airlines_df['# of Delays.Late Aircraft'] = 500
airlines_df['# of Delays.Late Aircraft']

In [None]:
# Renaming columns
airlines_df.rename(columns={"# of Delays.Late Aircraft": "DelayNumber"})
airlines_df

In [None]:
# Create a new dataframe from columns
col1 = "Cancelled"
col2 = "Name"
airlines_new_df = airlines_df[[col1, col2]]
airlines_new_df

## Data Visualization

We can use Matplotlib, a data visual library to help with plotting!

For more information:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html --- Check out the documentation here!

These are some general examples ... How would you do this for the existent values from airlines?

In [None]:
# Plotting Lines
values = [[1, 2], [2, 5]]
df_example = pd.DataFrame(values, columns=['Type A', 'Type B'], 
                   index=['Index 1', 'Index 2'])
df_example.plot(lw=2, colormap='jet', marker='.', markersize=10, 
         title='Video streaming dropout by category')

In [None]:
#Run a Scatter Plot on the Data Frame
df = pd.DataFrame([[5.1, 3.5, 0], [4.9, 3.0, 0], [7.0, 3.2, 1],[6.4, 3.2, 1], [5.9, 3.0, 2]],columns=['length', 'width', 'species'])
df.plot.scatter(x='length',y='width',c='DarkBlue')

In [None]:
#Run a Bar Plot 
df2 = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
df2.plot.bar();

In [None]:
df2.plot.bar(stacked=True)

In [None]:
# CHALLENGE - how about for the airlines dataset? what are some things that we could plot?

## NOTES

This was a very brief "introduction" to Pandas, and the best way to learn is through practice! So here are some resources to look at!

https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html -- General tutorials

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html -- More information on Plotting

https://www.dataquest.io/blog/pandas-cheat-sheet/ -- Pandas "Cheat Sheet"

https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html -- Visualizations



### Extra: Linear Regression