<img src="./img/vi_logo.png" style="float: left; margin: 10px; height: 45px">

# Vertical Institute Data Science 101
# Lesson 3: Numpy and Pandas


---


### Learning Objectives

#### Part 1: Numpy
**After this lesson, you will be able to:**
- Use Python data science libraries to perform common data analytics tasks easily
- Use basic operation of ndarrays
- Index and iterate through numpy arrays

#### Part 2: Pandas
**After this lesson, you will be able to:**
- Understand series and dataframes
- Use functions such as unique, value_counts and mathematical operations
- Use basic operations of dataframes for data analysis
- Join tables using pandas

These are important packages in Python especially in the Financial services

## Importing packages

By default, we can call in-built functions from Python's default library. Examples include `print` or `type`.

As we want to start using other functions, we need to first import the appropriate libraries.

By installing the Anaconda distribution, all the libraries we need are already installed on your computer. 

So to use them in our notebook, we only have to call the library out via `import <library>`

Some of the core packages include:
- ***NumPy*** : Base N-dimensional array package
- ***Pandas*** : Data Analysis
- ***Matplotlib*** : Comprehensive plotting
- ***SciPy***: Used for scientific computing and technical computing

In [None]:
#import the package
import numpy as np

<a id='numpy'></a>

## Part 1: NumPy

- ***NumPy*** is the basic package for scientific computing in Python
- Pandas and Sklearn represent information using NumPy under the hood
- NumPy library is based on a main object ***ndarray*** which stands for __N-dimensional array__


### Ndarray 
    
- Designed for fast vectorized numerical computations
- The data type is specified by the dtype attribute
- The ***array()*** function is used to define a new ***ndarray*** by passing in a Python list containing the elements to be included.

## What is the difference between a list and an array?

Lists and Numpy arrays look very similar as containers of variables

In [None]:
# To create a list, we use square brackets
my_list=[10, 20, 30, 40]
my_list

In [None]:
# To create an array, we use the `np.array()` function and pass in a list
my_array = __._____(my_list)
my_array

In [None]:
## Similarities
print(my_list[0])
print(my_array[0])

### What are the differences?

1. Numpy vs python list: https://webcourses.ucf.edu/courses/1249560/pages/python-lists-vs-numpy-arrays-what-is-the-difference
2. How python variables work, their overhead contained: https://www.youtube.com/watch?v=0Om2gYU6clE&t=56s

- We start to utilize numpy arrays instead of lists as we start working with larger datasets
- `pandas` and `scikit-learn` uses numpy under the hood

This efficiency comes with a strict rule. In a Numpy array, we can only put variables of the **same data type**.

Whereas in a list, we can add variables of different data types.

In [None]:
my_list=[10, 2.0, "bank"] # this works
my_list

In [None]:
my_array=np.array([10, 2.0, "bank"]) #this doesn't
my_array

### Properties of a ndarray

#### 1-dimensional 

In [None]:
a = np.array([10, 20, 30, 40])

In [None]:
# print the array
print(_)

In [None]:
# print the type
print(____(a))

In [None]:
# let's print some numpy array attributes: values data type
print(a.dtype)

In [None]:
# let's print some numpy array attributes: shape
print(a.____)

#### 2-dimensional 

In [None]:
b = np.array([[1.2, 3.4],
              [0.1, 0.2]])

In [None]:
print(b) # print array
print(type(b)) # print variable type
print(b.dtype) # print the values data types
print(b.shape) # print the shape
print(b.ndim) # print the number of dimensions

<a id='ops_arr'></a>
### Basic Operations of ndarray object
- __Arithmetic operations__ can be performed with arithmetic operators between an array and a scalar or array
- Operations between arrays are ***element-wise***, which means operators are applied only between corresponding elements that occupy the same position

<img src="img/arithmetic.png" style="margin: 20px; height: 400px">

### Broadcasting and Vectorization

- 2 Rules of broadcasting with examples: https://numpy.org/doc/stable/user/basics.broadcasting.html#general-broadcasting-rules
- Lower dimensions can be broadcasted more than 1 dimension up
    - A scalar 0D can be broadcast into 1D or 2D or any D depending on other operand
    - a 1D can be broadcast into 2D, 3D and so on
    
- For data manipulation with pandas, you only need to know broadcasting of scalars to 1D, more complex broadcasting is for implementations of machine learning algorithms

In [None]:
a = np.array([10, 20, 30, 40])
print(a)

In [None]:
# multiply each value in array a by 4
print(______)

In [None]:
b = np.array([4, 5, 6, 7])
print(b)

In [None]:
# add a and b together element-wise
print(a + b)

In [None]:
# multiply a and b together element-wise
print(_______)

### Exercise
- 1. Given arrays a and b, create c using arithmetic operators
- 2. Do the same without using b this time, by making use of broadcasting

In [None]:
a = np.array([2, 3, 4])
b = np.array([1, 1, 1])
c = np.array([0, 1, 2])
print(a)
print(b)
print(c)

#### Solution 

### Mind the axis! 
- applies to pandas later too
    - except that pandas only uses axis 0 (default) or 1, no aggregations over both axis at once like numpy does

In [None]:
axis_demo = np.array([[1, 2, 3],
                      [4, 5, 6]])
axis_demo

In [None]:
# consider all 6
axis_demo._____() 

In [None]:
# consider column by column
axis_demo.sum(______) 


In [None]:
# consider row by row
axis_demo.sum(_______) 

### How do we know all the functions within new libraries?

In [None]:
# Method 1: Tab completion

# Tab on the keyboard 

# Method 2: Documentation
# https://numpy.org/doc/1.18/

<a id='index'></a>
### Indexing and Iterating an array

Similar to how we indexed Python strings, lists, etc, array indexing allows us to extract a value, select items and assign new values to the array

***Indexing***

In [None]:
# Indexing a monodimensional array
a = np.array([10, 20, 30, 40])
print(a)

In [None]:
# Get the 3rd value of the array


In [None]:
A = np.array([[10, 20, 30], 
              [40, 50, 60], 
              [70, 80, 90]])
print(A)

<img src="img/ndarray_index.png" style="margin: 20px; height: 400px">

#### Getting 1 value

- Syntax is `A[row_index, col_index]`

In [None]:
# get first row first column's value - row index 0, col index 0
print(A[0, 0])


In [None]:
# get second row second column's value - row index 1, col index 1
print(A[1, 1])


In [None]:
# You try
# get third row third column's value - row index 2, col index 2
print(A[_, _])

#### Getting list of values 
- syntax is `arr[row_index]`

In [None]:
# getting the first row - row index 0

print(A[0])
print(A[0, :]) # same output. The colon : means all columns if it is AFTER the comma

In [None]:
# getting the first column - column index 0

print(A[:, 0])  # The colon : means all columns if it is BEFORE the comma

In [None]:
# now you try getting the entire third column - column index ?what?
# fill in the blanks

print(A[_____])

### Slicing works on numpy arrays too 
- Syntax is `arr[row_slice, col_slice]` or `arr[start:stop:step, start:stop:step]`
- Remember slice syntax is `start:stop:step`

In [None]:
# recreating the same 2-d array
A = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90]).reshape(3,3)
print(A)

In [None]:
# select row index 0 and 1, each row is a 1d array
print(A[:2])

In [None]:
# or 
print(A[0:2]) # both are the same. we can leave out the START if we are ok to start from the beginning.

In [None]:
# how about this?
print(A[::2]) # start from the first row, step 2 (skip) 

In [None]:
# another example using the step option
B = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90])
print(B[::2])

In [None]:
# how about this?
print(B[::3])

### Exercise 
- Slice the given array A on both it's rows and columns to extract 
```
[[40, 50],   
 [70, 80]] 
```

In [None]:
# creating the array again
A = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90]).reshape(3,3)
print(A)

In [None]:
# write the slicing logic to get the result in the question
A[___, ____]

## Part 2: Pandas

### Overview of Pandas

- Pandas is a Python library for performing data analysis, built upon the NumPy library
- Better than NumPy because pandas allows different data types, and can label-indexing on top of position index
- Main objects
    - Series (1 column)
    - DataFrame (horizontally stacked collection of Series)
    
- Allows vectorized operations and broacasting beneath the hood
- Flexible in manipulating and generating new columns/rows, renaming indexes pivoting from rows to cols and vice versa
- Index Alignment feature (depends on having row/col labels) saves manual row/col alignment as required in numpy

**Pandas data structures**
- **Series** (data, index, name)
- **DataFrame** (many Series horizontally joined together, each name of Series becoming column name)

In [None]:
# to ensure that all of us have the same pandas version, run the following.
# this code will install a specific pandas version. You don't have to run this ever again to use pandas.
!pip install pandas==1.3.5

<a id='series'></a>
## Series
A Series is a one-dimensional object similar to an array, list or column of a table.

In [None]:
import pandas as pd

In [None]:
# Series
deposits = pd.Series([4502, 234, 9582, 7], name='Balances')
deposits

### Getting details about the series 

In [None]:
# print out some attributes of the pandas series
print(deposits.values) # numpy array under the hood
print(deposits.index) 
print(deposits.name)

In [None]:
# If we want, we can relabel the index
customer_names = ["Arya","Sansa","Robb","Bran"]
deposits.______ = customer_names
deposits

### Reading data from the grades
- **Position** (**end-exclusive**) or **Label** (**end-inclusive**)
    - **single** `deposits[1]`, `deposits['Arya']`
    - **slice**  `deposits[1:3]`, `deposits['Arya':'Robb']`
        - for consecutive ranges
        - you don't know exactly which rows but know endpoints
    - **list** `deposits[1,3,2]`,`deposits[['Sansa','Bran','Robb']]`
        - for non-consecutive
        - you know exactly what you want
        - you want to select in random order
- labels are more intuitive and common way to select, who can remember position!?

In [None]:
deposits

#### Select by single position 

In [None]:
deposits[2]

#### Select by single label

In [None]:
deposits["Sansa"]

#### Select by slice of positions 

In [None]:
# We can select a range of elements (end index is exclusive)
deposits[0:2]

#### Select by slice of labels 

In [None]:
# We can select a range of index names (end index is inclusive)
deposits['____':'_____']

#### Select by list of positions 

In [None]:
deposits[[0,2]]

#### Select by list of labels 

In [None]:
# use a list to select specific labels
customers_to_check = ["Robb","Bran"] 
deposits[customers_to_check]  # no need [[]] since using list variable already

### Assigning new values 
- Appearing on right side of assignment operator means selection
    - `third_customer_balance = deposits[2]`
- Appearing on left side of assignment operator means update/insert 
    - `deposits[2] = 0`

#### Individual values 

In [None]:
# assigning new values to an element
# try to assign 0 to Arya label
deposits['____'] = __
deposits

#### Range of values 

In [None]:
deposits['Arya':'Robb'] = 0 # side note: you can pick specific people by doing this too - deposits[['Sansa','Robb']]

deposits

In [None]:
# creating a new index
deposits['NewCustomer'] = 100
deposits

### Boolean indexing
- Most common way to select/update
- Used when slice endpoints are unknown, and what's read/updated depends on contained values

In [None]:
deposits = pd.Series([4502, 234, 9582, 7], name='Balances')

print(deposits > 2500) # you can create a boolean series for filtering easily

In [None]:
# try this

mask = deposits > 2500
print(deposits[mask])

### Exercise 

- Use boolean indexing to filter bank products that have exactly 7 letters
- Hint use `series.str.len()`: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.len.html 


In [None]:
# you are given a panda series with 5 values
bank_products = pd.Series(['deposit', 'lending', 'creditcard', 'installment', 'advance'],name='Bank Products')
bank_products

In [None]:
# clue 1: try to run the code below
bank_products.str.len()

# how can you use this result to complete the exercise?

In [None]:
# write your code below



### Multiple Conditions using logical and, or, not
- and is now `cond1 & cond2`
- or is now `cond1 | cond2`
- not is now `~cond`

In [None]:
deposits = pd.Series([4502, 2655, 9582, 7], name='Balances')
deposits

In [None]:
above_2500 = deposits > 2500
below_3000 = deposits < 3000

# Code below
# Can you filter deposits rows that is between 2500 and 3000?
inside_range = ______ & _______
print(inside_range)
print(deposits[inside_range])

In [None]:
outside_range = above_2500 | below_3000
print(outside_range)

deposits[outside_range]
print(deposits[outside_range])

### Exercise 
- Select rows with values 2,3,4,5 from the series to practice boolean indexing
- Stretch: Once you can do it with boolean indexing, get creative and practice label/positional indexing too to solve the above

In [None]:
s = pd.Series(index=['a','b','c','d','e','f','g','h'],
              data=[1,2,3,4,5,6,7,8])
print(s)
# your code below

In [None]:
# Answer


In [None]:
# Stretch


### Basic Operations
***Mathematical operations and functions***
- Broadcasting from numpy happening here (**We have seen this before in numpy!**)

In [None]:
deposits

In [None]:
# we can use various operations on series, 
# used to generate new columns of information during feature engineering in machine learning
print(deposits / 100)  # pandas uses numpy, numpy using broadcasting here 100 becomes 100,100,100,100
print(deposits * 2)


In [None]:
# getting some statistics, for more of what you can do, dot-tab, or dir(), 
# or look at left side panel of https://pandas.pydata.org/docs/reference/api/pandas.Series.html

print(deposits.mean()) # average
print(deposits.std())  # standard deviation
print(deposits.var())

### Common analytics tools
- unique()
- value_counts() --> same as unique but also shows number of each unique
    - Usually chained with sort_index/values

In [None]:
s = pd.Series([10, 20, 30, 0, 10, 20, 0],
              index=['a','b','c','d','e','f','g']) 

In [None]:
print(s)

In [None]:
# Return all unique values as numpy array, note the change in type, meaning you can no longer chain series methods, but only numpy array methods


In [None]:
# Count the frequence for each unique value --> Used to plot histograms 
print(s.value_counts())  # default sorts values descending
print(s.value_counts().sort_index())
print(s.value_counts().sort_values())

<a id='dataframe'></a>
## DataFrame

- Tabular data structure very similar to spreadsheet or relational database table
- Series in multiple dimensions
- Essentially an ordered collection of columns, each of which can contain values of different types

In [None]:
# common way to manually create a pandas dataframe
# is through a dictionary with key as column name, value as nested list of values
# the list of values can be any 1D sequence (eg. ndarray, pd.Series, tuple,) usually created programmatically

data = {"cust_id": [1,2,3,4,5,6],
       "cust_name": ["Carl", "Aladdin", "Simba", "Rapunzel", "Cinderella", "Coco"],
       "savings": [25.99,31.99,20.99,25.99,20.99,26.99]}

df = pd.DataFrame(data)
df

In [None]:
#data structure
print(type(df))

In [None]:
# look at the first 5 items of your dataframe, put a number in head(n)/tail(n) to see top/last n rows


In [None]:
# last 5 items of df


### Selecting column(s) 

In [None]:
# Retrieve the column 'cust_name'
df['cust_name']


# OR df.cust_name also works (dot notation)

In [None]:
# Retrieve the column 'cust_name' and 'savings', note the list used again, but now along columns (2nd dimension)
df[['cust_name','savings']]   

#### Boolean Indexing 

In [None]:
# Evaluate if savings is > $23
df['savings'] ______

In [None]:
# If customer name length is less than 6 characters
df['cust_name'].str.len() < 6

In [None]:
# use variables for better code
saving_over_23 = df['savings']>23
short_cust_name = df['cust_name'].str.len() < 6

# Enter the boolean series in the blank
df[_________ & _________]  # different from series, now each condition can be generated from different columns


### Exercise  
- Find customers where the name is one of `['Simba','Coco']` (Stretch: use `series.isin(['Simba','Coco']`) and savings is less than 30

In [None]:
data = {"cust_id": [1,2,3,4,5,6],
       "cust_name": ["Carl", "Aladdin", "Simba", "Rapunzel", "Cinderella", "Coco"],
       "savings": [25.99,31.99,20.99,25.99,20.99,26.99]}

df = pd.DataFrame(data)
df

In [None]:
# your code 


### Some attributes of a DataFrame

In [None]:
# VERY USEFUL!
df.info() # more information than df.dtypes, used to find number of missing values under "Non-Null Count"

In [None]:
# check number of rows and columns
df.shape # used for checking you got the correct number of rows, columns after some filtering/generation operation

### Indexing 

- use the __iloc__ (**end-exclusive**) attribute with the required positional index 
- use the __loc__ (**end-inclusive**) attribute with the requied label index

In [None]:
data = {"cust_id": [1,2,3,4,5,6],
       "cust_name": ["Carl", "Aladdin", "Simba", "Rapunzel", "Cinderella", "Coco"],
       "savings": [25.99,31.99,20.99,25.99,20.99,26.99]}

df = pd.DataFrame(data)
df = df.set_index('cust_name') # moving a column into index to demonstrate position vs label indexing

df

In [None]:
# using iloc - index locate
# print the example and study the output
print(df.iloc[2])  # a row series instead of a column, the 3rd row because python is 0-indexed

In [None]:
# using iloc - index locate
# print the example and study the output
print(df.iloc[2:5]) # slice, iloc is end-exclusive so only get rows 2,3,4 excluding 5

In [None]:
# using iloc - index locate
# print the example and study the output
print(df.iloc[[2,3,4]]) # specifying the parts of the slice 1 by 1, used for debugging usually

In [None]:
# using loc - label locate
# print the example and study the output
print(df.loc['Simba'])# label based

In [None]:
# using loc - label locate
# print the example and study the output
print(df.loc['Simba':'Cinderella']) # slice based

In [None]:
# using loc - label locate
# print the example and study the output
row_labels = ['Simba','Rapunzel','Cinderella']
df.loc[row_labels]

In [None]:
# You try! How to get a result with only Carl and Coco? 
# You can do it with iloc or loc


### Updating

#### Changing rows for a particular column

- df.loc[row_indexer, column_indexer]
- while working on multiple columns is possible, we usually work on 1 column at a time as physically each column means something different and should be handled differently

In [None]:
df

In [None]:
df.loc['Carl',"savings"] = 0
df

#### Changing columns 
- Feature engineering in machine learning requires generating new columns and updating existing ones

##### Update existing column 

In [None]:
# you can generate 1 col from more than 1 col


In [None]:
# you can assign multiple columns as long as num of cols on both sides of = match


### Exercise
- Update the savings column with less than 25 dollars to 25 dollars 
- Remember syntax is `df.loc[boolean_series_created_from_comparison, column_name_string]`

In [None]:
data = {"cust_id": [1,2,3,4,5,6],
       "cust_name": ["Carl", "Aladdin", "Simba", "Rapunzel", "Cinderella", "Coco"],
       "savings": [25.99,31.99,20.99,25.99,20.99,26.99]}

df = pd.DataFrame(data)
df = df.set_index('cust_name') # moving a column into index to demonstrate position vs label indexing

df

In [None]:
# Your code below: Fill in the blanks
# Clue: If you need to more than one value at once, you need to FILTER for those relevant rows
df.loc[______, ______] = ____

### Axis = 0 or 1? 

***Statistic functions can be applied on the DataFrame***
- List of dataframe methods: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

In [None]:
# recreate the dataframe
data = {"cust_id": [1,2,3,4,5,6],
       "savings": [25.99,31.99,20.99,25.99,20.99,26.99]}

df = pd.DataFrame(data)
df

In [None]:
# study this outputs. why do you get this result?
print(df.sum()) # default is axis=0, down the columns, each result is the sum down 1 column

In [None]:
# study this output
print(df.sum(axis = 1))   # more rarely we can to sum across rows. 

In [None]:
# more functions
print(df.____())
print(df.____())

In [None]:
# VERY USEFUL!!


### Importing Data into DataFrame
- More examples of what formats pandas can read from/write to, Excel is common use-case: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

- we can read data file in various format into a DataFrame object
- CSV files can be read with ***read_csv()*** function

In [None]:
#import movie metadata
movies = pd.read_csv("assets/movies.csv")

In [None]:
movies.info()

In [None]:
#check our dataset


## Joins in Pandas

How to join 2 DataFrames to get the data we need to solve our questions.

We have the movies DataFrame which contains movie title

And, we have the ratings DataFrame which contain each movie's rating

In [None]:
ratings = pd.read_csv("assets/ratings.csv")

In [None]:
ratings.info()

In [None]:
# check your second dataset


## How do we get a combined table that has movieId, movie title and movie rating?

### Join using pd.merge

### Merge

- The merge function of pandas is equivalent to database-style joins
- There are 4 common joins

<img src="img/sql-joins.png" style="margin: 20px; height: 350px">
<img src="img/join.png" style="height: 500px">

In [None]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
movies.merge(ratings, how='inner')

In [None]:
# how many rows do you see? why? 
# which values are NaN? Why?
movies.merge(ratings, how='left')

In [None]:
# how many rows do you see? why? 
# which values are NaN? Why?
movies.merge(ratings, how='right')

In [None]:
# how many rows do you see? why? 
# which values are NaN? Why?
movies.merge(ratings, how='outer')

### Mismatching column names 
- requires specifying `left_on`, `right_on`
- Defines the 1(or more) set of columns from each side whose values are compared to find matches


In [None]:
movies.merge(ratings,left_on='movieId', right_on='movieId')

### Summarizing JOINS
- What information do I need and which columns in which tables are they in?
- Now that i know the tables, what are the column(s) from each table that let me join them
- Which type of join do I want?
    - Depends whether you want to not lose data from left table (**left join**), right table (**right join**), or both tables (**outer join**)
    - Depends on whether you don't want nans to be generated (**inner join**)
    - Depends on what's the unit of analysis
        - Usually customers table on left, purchases on right
        - customers left join purchases
        - those customers who never bought will not appear in purchases, but left join keeps these 0-purchase people in the result set (inner and right join will throw them out), so later groupby-analysis can count them to have bought 0 instead of completely missing them

## Concat

The ***concat()*** function does all of the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes. 

- i.e, as an example above, you can "add" your df1, df2, and df3 to form the dataframe "result"

<img src="img/concat.png" style="margin: 20px; height: 400px">

In [None]:
#Exercise
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2'],
                    'C': ['C0', 'C1', 'C2']},index=[0, 1, 2])

df2 = pd.DataFrame({'A': ['E2', 'E3', 'E6'],
                    'B': ['F2', 'F3', 'F6'],
                    'C': ['G2', 'G3', 'G6']},index=[3, 4, 5])

## Exercises 

### Joins 
- Try to join `fraud_customer.csv` and `fraud_transactions` together (both csv are inside assets folder at same level as this jupyter notebook)
    - use df.merge

In [None]:
customer = pd.read_csv('assets/fraud_customer.csv')
customer.head()

In [None]:
transaction = pd.read_csv('assets/fraud_transaction.csv')
transaction.head()

In [None]:
# Write the code to join the two datasets together
# Take note of the column names!!







## Lesson Summary


Let's review what we learned today. We:

- explored numpy arrays to use for basic operations and matrices
- used pandas dataframe and series to manipulate data
- understand and used different types of joins and explored data models and its applications to the movies dataset

- MUST USE CHEATSHEETS (Putting as your desktop background is a trick)

|         Select by Label         |      Explicit Syntax      | Shorthand Convention |
|:-------------------------------:|:-------------------------:|:--------------------:|
| Single column from dataframe    | df.loc[:,"col1"]          | df["col1"]           |
| List of columns from dataframe  | df.loc[:,["col1","col7"]] | df[["col1","col7"]]  |
| Slice of columns from dataframe | df.loc[:,"col1":"col4"]   |                      |
| Single row from dataframe       | df.loc["row4"]            |                      |
| List of rows from dataframe     | df.loc[["row1", "row8"]]  |                      |
| Slice of rows from dataframe    | df.loc["row3":"row5"]     | df["row3":"row5"]    |
| Single item from series         | s.loc["item8"]            | s["item8"]           |
| List of items from series       | s.loc[["item1","item7"]]  | s[["item1","item7"]] |
| Slice of items from series      | s.loc["item2":"item4"]    | s["item2":"item4"]   |

# Exercises
- Pandas exercises https://github.com/guipsamora/pandas_exercises
    - For those who never used git, just add `zipball/master/` to end of the url of any repository to download whole repository as 1 zip
    - Eg. https://github.com/guipsamora/pandas_exercises/zipball/master

# Readings

- Numpy 
    - use cases: https://arxiv.org/pdf/1102.1523.pdf
    - Broadcasting:
        - https://blog.finxter.com/numpy-broadcasting-a-simple-tutorial/
        - https://medium.com/@souravdey/l2-distance-matrix-vectorization-trick-26aa3247ac6c (application to KNN algorithm)
    - Reshaping (numpy): https://towardsdatascience.com/reshaping-numpy-arrays-in-python-a-step-by-step-pictorial-tutorial-aed5f471cf0b


- Pandas
    - Official cookbook: https://pandas.pydata.org/pandas-docs/version/0.17.0/cookbook.html#cookbook
        - Cookbooks are the next level go-to after familiar with syntax, to be exposed to common problems and patterns to solve using a tool
    - Different sort of reshaping for pandas: - Pandas: https://pandas.pydata.org/docs/user_guide/reshaping.html
    - Cheatsheets: https://blog.finxter.com/pandas-cheat-sheets/
    - JOINS with multiple matching rows on right table for each row on left (Venn diagram cannot express when multiple rows in right table match single row in left table): https://stackoverflow.com/questions/38549/what-is-the-difference-between-inner-join-and-outer-join#answer-27458534
    - Official Docs: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
    
- Quirks with pandas & and | 
    - &,| behaves differently in pandas than normal python because pandas operates on 1D operands
        - Involves general programming concept called **operator overloading**, same operator does different things depending on data type of operands, every library may have own quirk, not transferrable knowledge, learn by memory
        - You saw this in 1+1, 'good'+'morning', [1,2]+[3]
    - https://stackoverflow.com/questions/39388950/logical-or-bitwise-or-in-pandas-data-frame
    - https://stackoverflow.com/questions/21415661/logical-operators-for-boolean-indexing-in-pandas