# Lab01. Pandas Introduction

## What Does Pandas Do?
![a.jpg](attachment:a.jpg)

## What is a Pandas Table Object?
![b.png](attachment:b.png)


# Import packages

In [None]:
# import packages

import pandas as pd

# Extra packages
import numpy as np
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for plotting and styling

# jupyter notebook magic to display plots in output
%matplotlib inline

plt.rcParams['figure.figsize'] = (10,6) # make the plots bigger

# Part 1
### Simple creation and manipulation of Pandas objects
**Key Points:** Pandas has two / three main data types:
* Series (similar to numpy arrays, but with index)
* DataFrames (table or spreadsheet with Series in the columns)
* Panels (3D version of DataFrame, not as common)

### It is easy to create a DataFrame

### We use `pd.DataFrame(**inputs**)` and can insert almost any data type as an argument

**Function:** `pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)`

Input data can be a numpy ndarray (structured or homogeneous), dict, or DataFrame. 
Dict can contain Series, arrays, constants, or list-like objects as the values.

In [None]:
# Try it with an array
np.random.seed(0) # set seed for reproducibility

a1 = np.array(np.random.randn(3))
a2 = np.array(np.random.randn(3))
a3 = np.array(np.random.randn(3))

print (a1)
print (a2)
print (a3)

In [None]:
# Create our first DataFrame w/ an np.array - it becomes a column
df0 = pd.DataFrame(a1)
print(type(df0))
df0

In [None]:
# DataFrame from list of np.arrays

df0 = pd.DataFrame([a1, a2, a3])
df0

# notice that there is no column label, only integer values,
# and the index is set automatically

In [None]:
# DataFrame from 2D np.array

ax = np.array(np.random.randn(9)).reshape(3,3)
ax

In [None]:
df0 = pd.DataFrame(ax,columns=['rand_normal_1','Random Again','Third'],
                   index=[100,200,99]) # we can also assign columns and indices, sizes have to match
df0

In [None]:
# DataFrame from a Dictionary

dict1 = {'A':a1, 'B':a2}
df1 = pd.DataFrame(dict1) 
df1
# note that we now have columns without assignment

In [None]:
# We can easily add another column (just as you add values to a dictionary)
df1['C']=a3
df1

In [None]:
# We can add a list with strings and ints as a column 
df1['L'] = ["Something", 3, "words"]
df1

# Pandas Series object
### Like an np.array, but we can combine data types and it has its own index
Note: Every column in a DataFrame is a Series

In [None]:
print(df1['L'])

In [None]:
print(type(df1['L']))

In [None]:
df1

In [None]:
# We can also rename columns
df1 = df1.rename(columns = {'L':'Renamed'})
df1

In [None]:
# We can delete columns
del df1['C']
df1

In [None]:
# or drop columns
df1.drop('A',axis=1) # does not change df1 if we don't set inplace=True

In [None]:
df1

In [None]:
# or drop rows
df1.drop(0)

In [None]:
# Example: view only one column
df1['B']

In [None]:
# Or view several column
df1[['B','Renamed']]

# Other ways of slicing
In the 10 min Pandas Guide, you will see many ways to view, slice a dataframe

* view/slice by rows, eg `df[1:3]`, etc.

* view by index location, see `df.iloc` (iloc)

* view by ranges of labels, ie index label 2 to 5, or dates feb 3 to feb 25, see `df.loc` (loc)
 
* view a single row by the index `df.xs` (xs) or `df.ix` (ix)

* filtering rows that have certain conditions
* add column
* add row

* How to change the index

and more...

In [None]:
print (df1[0:2])  # ok

In [None]:
df1.iloc[1,1]

In [None]:
df1

# Part 2
## Finance example: Large Data Frames

### Now, lets get some data in CSV format.


In [None]:
# A CSV file is a comma seperated file
# We can also use 'pd.read_csv' method with urls that host csv files
def load_csv_data(file):
    return pd.read_csv(file).drop('Unnamed: 0',axis=1)

dfg = load_csv_data('data/google.csv') # Google stock data
dfa = load_csv_data('data/apple.csv') # Apple stock data

In [None]:
dfg.head() # show first five values

In [None]:
dfg.tail(3) # last three

In [None]:
dfg.columns # returns columns, can be used to loop over

In [None]:
dfg.index # return

# Convert the index to pandas datetime object

In [None]:
dfg['Date'][0]

In [None]:
type(dfg['Date'][0])

In [None]:
dfg.index = pd.to_datetime(dfg['Date']) # set index

In [None]:
dfg.drop(['Date'],axis=1,inplace=True)

In [None]:
dfg.head()

In [None]:
print(type(dfg.index[0]))
dfg.index[0]

In [None]:
dfg.index

In [None]:
dfg['2018']

# Attributes & general statitics of a Pandas DataFrame

In [None]:
dfg.shape # 251 business days last year

In [None]:
dfg.columns

In [None]:
dfg.size

In [None]:
# Some general statistics

dfg.describe()

In [None]:
# Boolean indexing
dfg['Open'][dfg['Open']>1130] # check what dates the opening

In [None]:
# Check where Open, High, Low and Close where greater than 1130
dfg[dfg>1000].drop('Volume',axis=1).head(3)

In [None]:
# If you want the values in an np array
dfg.values

## .loc()

In [None]:
# Getting a cross section with .loc - BY VALUES of the index and columns
# df.loc[a:b, x:y], by rows and column location

# Note: You have to know indices and columns

dfg.loc['2017-08-31':'2017-08-21','Open':'Low']

## .iloc()

In [None]:
dfg.columns

In [None]:
# .iloc slicing at specific location - BY POSITION in the table
# Recall:
# dfg[a:b] by rows
# dfg[[col]] or df[[col1, col2]] by columns
# df.loc[a:b, x:y], by index and column values + location
# df.iloc[3:5,0:2], numeric position in table

dfg.iloc[1:4,3:5] # 2nd to 4th row, 4th to 5th column

### More Basic Statistics

In [None]:
# We can change the index sorting
dfg.sort_index(axis=0, ascending=True).head() # starts a year ago

In [None]:
# sort by value
dfg.sort_values(by='Open')[0:10]

# Boolean

In [None]:
dfg[dfg>1115].head(10)

In [None]:
# we can also drop all NaN values
dfg[dfg>1115].head(10).dropna()

In [None]:
dfg2 = dfg.copy() # make a copy and not a view
dfg2 is dfg

### Setting Values


In [None]:
# Recall
dfg.head(4)

In [None]:
# All the ways to view
# can also be used to set values
# good for data normalization

dfg['Volume'] = dfg['Volume']/100000.0
dfg.head(4)

### More Statistics and Operations

In [None]:
# mean by column, also try var() for variance
dfg.mean()   

In [None]:
dfg[0:5].mean(axis = 1) # row means of first five rows

# PlotCorrelation
### Load several stocks

In [None]:
# Reload
dfg = load_csv_data('data/google.csv') # Google stock data
dfa = load_csv_data('data/apple.csv') # Apple stock data
dfm = load_csv_data('data/microsoft.csv') # Microsoft stock data
dfn = load_csv_data('data/nike.csv') # Nike stock data
dfb = load_csv_data('data/boeing.csv') # Boeing stock data

In [None]:
print (dfb.head())

In [None]:
# Rename columns
dfg = dfg.rename(columns = {'Close':'GOOG'})
#print (dfg.head())

dfa = dfa.rename(columns = {'Close':'AAPL'})
#print (dfa.head())

dfm = dfm.rename(columns = {'Close':'MSFT'})
#print (dfm.head())

dfn = dfn.rename(columns = {'Close':'NKE'})
#print (dfn.head())

dfb = dfb.rename(columns = {'Close':'BA'})

In [None]:
dfb.head(2)

In [None]:
# Lets merge some tables
# They will all merge on the common column Date

df = dfg[['Date','GOOG']].merge(dfa[['Date','AAPL']])
df = df.merge(dfm[['Date','MSFT']])
df = df.merge(dfn[['Date','NKE']])
df = df.merge(dfb[['Date','BA']])

df.head()

In [None]:
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df.head()

In [None]:
df.plot()

In [None]:
df['2017'][['NKE','BA']].plot()

In [None]:
# show a correlation matrix (pearson)
crl = df.corr()
crl

In [None]:
crl.sort_values(by='GOOG',ascending=False)

In [None]:
s = crl.unstack()
so = s.sort_values(ascending=False)
so[so<1]

In [None]:
sim=df-df.mean()
sim[['MSFT','BA']].plot()

# Part 3
## Exercises

### Now, lets do some related exercises.

Open a new Notebook Called Lab01_ex. Write python code in each cell to answer each question.

Write each program in its own cell.

Q1. Load the 'data/exam-scores.csv' into a pandas DataFrame called score. Note since this data set has headers, you can omit the header=None and names=arguments.

Q2. Write a program to determine if there is a correlation between a Student's score and the time it took them to complete the examination. Is there such a correlation?

Q3. Filter the scores for exam version 'D' only. What does the correlation look like now? Is it different? How so?

Q4. Let's the relationship between students who Made their own study guide and their exam score. Create a variable study containing the columns 'Made Own Study Guide', 'Student Score' and removes the rows with '?' in the Made Own Study Guide column. The output should look like this: 
![study-guide.png](attachment:study-guide.png)

Q5. Next we need to convert the Made Own Study Guide to a numeric value. To do this we will create a new series and add it to the DataFrame. Hint: use a list comprehension to evaluate 'Y' or 'N' and convert them to 1 or 0 respectively. When you’re done your study DataFrame should look like this: 
![study-guide-value.png](attachment:study-guide-value.png)

Q6. What is the correlation between the 'StudyGuideValue' and Exam Scores? Plot a scatter with 'StudyGuideValue' on the x axis, and include a screen shot.