# Topic 04: Python Libraries - Numpy and Pandas

## Overview

A library (or a module/package) is a pre-written piece of software that you can re-use rather than having to write that functionality yourself. So instead of having to write the code from scratch to plot a bar chart, you can use the Matplotlib library.

### NumPy 

In Python, the most fundamental package used for scientific computation is **NumPy** (Numerical Python). It provides lots of useful functionality for mathematical operations on vectors and matrices in Python. Matrix computation is the primary strength of NumPy. 


The library provides these mathematical operations using the NumPy **array** data type, which enhances performance and speeds up execution when compared to Python's default methods and data types. 

### Pandas

Pandas is a Python package designed to work with “relational” data and helps replicates the functionality of relational databases in a simple and intuitive way. Pandas is a great tool for data wrangling. It is designed for quick and easy data cleansing, manipulation, aggregation, and visualization.


There are **two main data structures** in the library: 

1. “Series” - one-dimensional
2. “DataFrames” - two-dimensional

These data types can be manipulated in a number of ways for analytical needs. Here are a few ways in which Pandas may come in handy:

* Easily delete and add columns from DataFrame
* Convert data structures to DataFrame objects
* Handle missing data and outliers
* Powerful grouping and aggregation functionality
* Offers visualization functionality to plot complex statistical visualizations on the go
* The data structures in Pandas are highly compatible with most of the other libraries 

## Numpy Arrays vs Python Lists

### Broadcasting

In [1]:
import numpy as np

numpy_arr = np.array([1, 2, 3, 4]) # you can simply coerce structured data into np.array()
print('Here is a NumPy array:', numpy_arr)
print('You know it is a NumPy array because its type is:', type(numpy_arr))

Here is a NumPy array: [1 2 3 4]
You know it is a NumPy array because its type is: <class 'numpy.ndarray'>


In [2]:
vector = np.array([1, 2, 3, 4, 5, 6])
matrix1 = np.array([[1, 2, 3], [4, 5, 6]])
matrix2 = np.array([[1, 2], [3, 4], [5, 6]])
tensor = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

# tensor10 = np.array([[natrix1]])

print(vector)
print('vector shape:', vector.shape, '\n')
print(matrix1)
print('matrix1 shape:', matrix1.shape, '\n')
print(matrix2)
print('matrix2 shape:', matrix2.shape, '\n')
print(tensor)
print('tensor shape:', tensor.shape, '\n')

[1 2 3 4 5 6]
vector shape: (6,) 

[[1 2 3]
 [4 5 6]]
matrix1 shape: (2, 3) 

[[1 2]
 [3 4]
 [5 6]]
matrix2 shape: (3, 2) 

[[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]]
tensor shape: (2, 2, 2) 



In [3]:
# mess around with indexing a tensor

tensor[1]

array([[5, 6],
       [7, 8]])

NumPy arrays allow for something known as **broadcasting**, which happens when you perform operations across arrays with different number of dimensions. NumPy makes duplicates of the lower-dimension array as long as the higher-dimension array **contains the same shape** in order to execute the operation. Order of the `.shape` tuple matters!

In [4]:
scalar = 4
print(vector)

print(vector + scalar)

[1 2 3 4 5 6]
[ 5  6  7  8  9 10]


In [7]:
v1 = np.array([1, 0, 1])
m1 = np.array([[1, 2], [3, 4], [5, 6]])
print(v1)
print(m1)
m1 + v1

[1 0 1]
[[1 2]
 [3 4]
 [5 6]]


ValueError: operands could not be broadcast together with shapes (3,2) (3,) 

What shapes of matrices and vectors can be broadcast onto `tensor2`?

In [9]:
print(tensor)

[[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]]


In [None]:
# notes here: 
# 
#

### Creating Arrays
NumPy also has several built-in methods for creating arrays that are useful in practice. These methods are particularly useful:
* `np.zeros(shape)` 
* `np.ones(shape)`
* `np.full(shape, fill)`

## Pandas!

In [10]:
import pandas as pd

### Getting Data In/Out of Pandas

In [12]:
car_df = pd.read_csv('auto-mpg.csv')

In [13]:
car_df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
387,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl
388,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
389,32.0,4,135.0,84,2295,11.6,82,1,dodge rampage
390,28.0,4,120.0,79,2625,18.6,82,1,ford ranger


In [14]:
car_df.to_csv('cleaned_cars.csv', index=False) 

#### Turning Nested Lists, Dictionaries to Pandas

In [15]:
spanish_numbers = ['uno', 'dos', 'tres', 'cuatro', 'cinco']

dictionary = {'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5}

In [16]:
numbers = pd.DataFrame.from_dict(dictionary, 'index') # tell them whether the keys are the column names or the index
numbers

Unnamed: 0,0
one,1
two,2
three,3
four,4
five,5


In [17]:
numbers.columns = ['numerical']
numbers

Unnamed: 0,numerical
one,1
two,2
three,3
four,4
five,5


In [18]:
# new columns: df[new column name] = new column data
numbers['spanish'] = spanish_numbers
numbers

Unnamed: 0,numerical,spanish
one,1,uno
two,2,dos
three,3,tres
four,4,cuatro
five,5,cinco


In [20]:
from sklearn.datasets import load_wine
data = load_wine() # loading a built-in dataset

df = pd.DataFrame(data.data, columns=data.feature_names)

In [21]:
df

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.20,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.50,16.8,113.0,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740.0
174,13.40,3.91,2.48,23.0,102.0,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750.0
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835.0
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840.0


### DataFrame Attributes
Attributes you can think of as *characteristics* of the DataFrame, where no computation has to be done. Attributes don't end with `()`.
- `.index`
- `.columns`
- `.dtypes`
- `.shape`


In [28]:
# try out all the Pandas things with the wine dataset!
df.index

RangeIndex(start=0, stop=178, step=1)

### Methods to keep in your back pocket:
- `.head()`
- `.tail()`
- `.info()`
- `.describe()`
- `.isna().sum()` (you can stack methods in one line!!)


In [37]:
df.isna().sum()

alcohol                         0
malic_acid                      0
ash                             0
alcalinity_of_ash               0
magnesium                       0
total_phenols                   0
flavanoids                      0
nonflavanoid_phenols            0
proanthocyanins                 0
color_intensity                 0
hue                             0
od280/od315_of_diluted_wines    0
proline                         0
dtype: int64

### Selecting
- `.iloc[rows, columns]` a Pandas DataFrame indexer used for integer-location based indexing / selection by position 
- `.loc[rows, columns]` has two use cases:
    - Selecting by label / index
    - Selecting with a boolean / conditional lookup
- Filtering a DataFrame
    - `.filter()` function https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html
    - Filtering using conditions

In [43]:
df.iloc[:10, :5]

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium
0,14.23,1.71,2.43,15.6,127.0
1,13.2,1.78,2.14,11.2,100.0
2,13.16,2.36,2.67,18.6,101.0
3,14.37,1.95,2.5,16.8,113.0
4,13.24,2.59,2.87,21.0,118.0
5,14.2,1.76,2.45,15.2,112.0
6,14.39,1.87,2.45,14.6,96.0
7,14.06,2.15,2.61,17.6,121.0
8,14.83,1.64,2.17,14.0,97.0
9,13.86,1.35,2.27,16.0,98.0


In [45]:
df.columns

Index(['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium',
       'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
       'proanthocyanins', 'color_intensity', 'hue',
       'od280/od315_of_diluted_wines', 'proline'],
      dtype='object')

In [55]:
df.loc[:10, ['ash', 'hue', 'alcohol']]

Unnamed: 0,ash,hue,alcohol
0,2.43,1.04,14.23
1,2.14,1.05,13.2
2,2.67,1.03,13.16
3,2.5,0.86,14.37
4,2.87,1.04,13.24
5,2.45,1.05,14.2
6,2.45,1.02,14.39
7,2.61,1.06,14.06
8,2.17,1.08,14.83
9,2.27,1.01,13.86


In [54]:
df['alcohol'] > 14

0       True
1      False
2      False
3       True
4      False
       ...  
173    False
174    False
175    False
176    False
177     True
Name: alcohol, Length: 178, dtype: bool

In [57]:
df.loc[df['alcohol'] > 14]

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
5,14.2,1.76,2.45,15.2,112.0,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450.0
6,14.39,1.87,2.45,14.6,96.0,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290.0
7,14.06,2.15,2.61,17.6,121.0,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295.0
8,14.83,1.64,2.17,14.0,97.0,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045.0
10,14.1,2.16,2.3,18.0,105.0,2.95,3.32,0.22,2.38,5.75,1.25,3.17,1510.0
11,14.12,1.48,2.32,16.8,95.0,2.2,2.43,0.26,1.57,5.0,1.17,2.82,1280.0
13,14.75,1.73,2.39,11.4,91.0,3.1,3.69,0.43,2.81,5.4,1.25,2.73,1150.0
14,14.38,1.87,2.38,12.0,102.0,3.3,3.64,0.29,2.96,7.5,1.2,3.0,1547.0


## Other Things:
- Lesson on Data Privacy and Ethics
- Explore Kaggle (from the Learn.co lesson `Kaggle and the Ames Housing Dataset`)


## Next Section:
- Using `lambda` with a Series/DataFrame
- Data cleaning techniques
- Data visualizations: we'll get to that in Section 06, but this is a good starting point if you want to work on the mini project https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html