# NumPy and Pandas for One Dimensional Data  

## Overview
### **NumPy (Numerical Python)**  
- Called an array  
- Simpler  

### **Pandas**  
- Called a series   
- More features  
- Built on NumPy Arrays

### Similarities
- Access elements by position, i.e. a[0]  
- Access a range of elements, i.e. a[1:3]  
- Use loops, i.e., for x in a:  

### Differences
- NumPy array: each element should have the same type  
- NumPy includes a bunch of convenient functions  
- NumPy arrays can be multi-dimensional  

---

## NumPy Array

NumPy Arrays are like souped-up Python lists

### Vectorized Operations  
A vector is a list of numbers   
NumPy supports Vectorized Operations between   
- 2 NumPy arrays or   
- 1 Numpy array and a single number  

**Vector Addition**
- NumPy: When you add two NumPy arrays, each element in the array is added together
- Python: When you add two lists together in Python, we get list concatentation (New list w/ all elements of the 1st list followed by all elements of the 2nd list) 

**Multipyling by a Scaler **  
- NumPy Arrays: Each element of the vector is multipled by 3  
- Python: creates a list with all of the values of the original list repeated 3 times

### **More Vectorized Operations**   

**Math Operations**  
Add, Subtract, Multiply, Divide, Exponentiate    

**Logical Operations **  
Make sure your arrays contain booleans  
And & , Or | , Not ~    

**Comparison Operations ** 
Greater: >  
Greater or equal >=  
Less <  
Less <=  
Equal ==    
Not equal !=   

### Slicing

When slicing NumPy array just creates a different view of the original array (not a copy!) So when you make changes, it will be reflected in the original array.  



---

## Pandas Series  
A series is similar to a NumPy array, but with extra functionality, e.g. s.describe()

All the things we just learned how to do on NumPy arrays will also work on Pandas Series  
- Accessing elements s[0], s[3:7]  
- Looping for x in s  
- Convenient functions, i.e. s.mean(), s.max()    
- Vectorized Operations s1 + s2    
- Implemented in C (fast!)    

A Pandas series is like a cross between a list and a dictionary  
- lists are stored in order and can be accessed by position
- dictionary have a key and value, and you can look up values by key  

.iloc access code by position  
.loc lets you look up values by index label  

If you create a series without specifying an index, then it will automaticallyl assign one starting with 0

### Vectorized Operations and Series Indexes  
When you add two Numpy arrays, you're adding by position  

When you add Pandas series, you're adding by index (not position)
- indexes can be in different order.
- if there are indexes present in one series and not the other, get NaN

### For Non Built-In Calculations  
1. Treat the series as a list (for loops, etc)
2. Use the function apply()  
apply() takes a series and a function and returns a new series  

Similar to python function map() but apply works on series instead of lists  

---


# NumPy and Pandas for Two Dimensional Data
  
### **Python:** List of lists    
- accessing elements a[1][3]  

### **NumPy: **2D Array
- more memory efficient than lists
- accessing elements is a bit different, i.e. a[1,3] 
- mean(), std(), operate on an entire array as a whole
 - To do operations along an axis (row or column)  
     - ridership.mean(axis = 0) function on each row
     - ridership.mean(axis = 1) function for each column 
 
 
### **Pandas:** Dataframe 
- preferred because more functionality  
- have indexes similar to Pandas Series
    - Can access elements using both indexes, i.e., ridership_df.loc['05-04-11', 'R004']
- great data structure to represent CSV file because it is a 2D data structure with a different type for each column  
- each column is assumed to a different type so operations like .mean are performed for each column automatically
- DataFrame vectorized operations are similar to vectorized operations for 2D NumPy arrays  
    - But match up elements by index and column name rather than position  

In [27]:
import pandas as pd
subway_df = pd.read_csv('nyc_subway_weather.csv')

In [28]:
subway_df.head()

Unnamed: 0,UNIT,DATEn,TIMEn,ENTRIESn,EXITSn,ENTRIESn_hourly,EXITSn_hourly,datetime,hour,day_week,...,pressurei,rain,tempi,wspdi,meanprecipi,meanpressurei,meantempi,meanwspdi,weather_lat,weather_lon
0,R003,05-01-11,00:00:00,4388333,2911002,0.0,0.0,2011-05-01 00:00:00,0,6,...,30.22,0,55.9,3.5,0.0,30.258,55.98,7.86,40.700348,-73.887177
1,R003,05-01-11,04:00:00,4388333,2911002,0.0,0.0,2011-05-01 04:00:00,4,6,...,30.25,0,52.0,3.5,0.0,30.258,55.98,7.86,40.700348,-73.887177
2,R003,05-01-11,12:00:00,4388333,2911002,0.0,0.0,2011-05-01 12:00:00,12,6,...,30.28,0,62.1,6.9,0.0,30.258,55.98,7.86,40.700348,-73.887177
3,R003,05-01-11,16:00:00,4388333,2911002,0.0,0.0,2011-05-01 16:00:00,16,6,...,30.26,0,57.9,15.0,0.0,30.258,55.98,7.86,40.700348,-73.887177
4,R003,05-01-11,20:00:00,4388333,2911002,0.0,0.0,2011-05-01 20:00:00,20,6,...,30.28,0,52.0,10.4,0.0,30.258,55.98,7.86,40.700348,-73.887177


In [29]:
subway_df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ENTRIESn,42649.0,28124860.0,30436070.0,0.0,10397620.0,18183890.0,32630490.0,235774600.0
EXITSn,42649.0,19869930.0,20289860.0,0.0,7613712.0,13316090.0,23937710.0,149378200.0
ENTRIESn_hourly,42649.0,1886.59,2952.386,0.0,274.0,905.0,2255.0,32814.0
EXITSn_hourly,42649.0,1361.488,2183.845,0.0,237.0,664.0,1537.0,34828.0
hour,42649.0,10.04675,6.938928,0.0,4.0,12.0,16.0,20.0
day_week,42649.0,2.905719,2.079231,0.0,1.0,3.0,5.0,6.0
weekday,42649.0,0.7144364,0.4516877,0.0,0.0,1.0,1.0,1.0
latitude,42649.0,40.72465,0.07164979,40.576152,40.67711,40.71724,40.75912,40.88918
longitude,42649.0,-73.94036,0.0597126,-74.073622,-73.98734,-73.95346,-73.90773,-73.75538
fog,42649.0,0.00982438,0.09863108,0.0,0.0,0.0,0.0,1.0


In [30]:
# Create function to calculate Pearson's R
def correlation(x,y):
    std_x = (x - x.mean())/x.std(ddof = 0)
    std_y = (y - y.mean())/y.std(ddof = 0)
    
    return (std_x * std_y).mean()

In [31]:
correlation(subway_df['ENTRIESn_hourly'], subway_df['meanprecipi'])

0.03564851577223041

In [32]:
correlation(subway_df['ENTRIESn_hourly'],subway_df['ENTRIESn'])

0.5858954707662182