### Convolution clarification
Input: Image (I) and Kernel (K)
Output: Convolution (C)

`C[i,j]=sum(I~[i:i+3,i:j+3]*K)`

3 is for 3x3 matrix


`z=f(x,y)`

z stands for the intensity of the color, and f(x,y) represents the position within the image

grey: R2 --> R
color: R2 --> R3

`(Gx* f)(x,y) ~= fx(x,y)`

# Lecture 9 Introduction to Pandas

[Pandas--*Python Data Analysis Library*](https://pandas.pydata.org/) provides the high-performance, easy-to-use data structures and data analysis tools in Python, which is very useful in Data Science. In our lectures, we only focust on the [elementary usages](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html).

In [1]:
import pandas as pd
import numpy as np

In [2]:
pip install pandas --upgrade

Requirement already up-to-date: pandas in c:\users\nhing\anaconda3\lib\site-packages (1.2.4)
Note: you may need to restart the kernel to use updated packages.


In [2]:
pd.__version__

'1.2.4'

In [3]:
dir(pd)

['BooleanDtype',
 'Categorical',
 'CategoricalDtype',
 'CategoricalIndex',
 'DataFrame',
 'DateOffset',
 'DatetimeIndex',
 'DatetimeTZDtype',
 'ExcelFile',
 'ExcelWriter',
 'Flags',
 'Float32Dtype',
 'Float64Dtype',
 'Float64Index',
 'Grouper',
 'HDFStore',
 'Index',
 'IndexSlice',
 'Int16Dtype',
 'Int32Dtype',
 'Int64Dtype',
 'Int64Index',
 'Int8Dtype',
 'Interval',
 'IntervalDtype',
 'IntervalIndex',
 'MultiIndex',
 'NA',
 'NaT',
 'NamedAgg',
 'Period',
 'PeriodDtype',
 'PeriodIndex',
 'RangeIndex',
 'Series',
 'SparseDtype',
 'StringDtype',
 'Timedelta',
 'TimedeltaIndex',
 'Timestamp',
 'UInt16Dtype',
 'UInt32Dtype',
 'UInt64Dtype',
 'UInt64Index',
 'UInt8Dtype',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__docformat__',
 '__file__',
 '__getattr__',
 '__git_version__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_config',
 '_hashtable',
 '_is_numpy_dev',
 '_lib',
 '_libs',
 '_np_version_under1p17',
 '_np_version_under1p18',
 '_testing'

## Important Concepts: `Series` and `DataFrame`

In short, `Series` represents one variable (attributes) of the datasets, while `DataFrame` represents the whole tabular data (it also supports multi-index or tensor cases -- we will not discuss these cases here).

Tabular data: each row is an observation, and each column is a variable

Series represents one attribute / variable (column), and different series are stacked to make up Dataframe

You can store data by rows or by columns; `series` store data by columns. It is more efficient to store memories with data in columns since when there are different variables, they are fixed, while if you stored them by rows, the data's size isn't fixed

`Series` is Numpy 1d array-like, additionally featuring for "index" which denotes the sample name, which is also similar to Python built-in dictionary type.

In [4]:
s1 = pd.Series([2, 4, 6]) # initialize something from a list
s1

0    2
1    4
2    6
dtype: int64

In [5]:
type(s1)

pandas.core.series.Series

In [6]:
s1.index # similar to array; this shows the index (starts at 0, ends at but not including 3, step=1)

RangeIndex(start=0, stop=3, step=1)

In [15]:
s2 = pd.Series([2, 4, 6],index = ['a','b','c']) # you can assign name to series

In [16]:
s2

a    2
b    4
c    6
dtype: int64

In [9]:
s2_num = s2.values # change to Numpy -- can be view instead of copy if the elements are all numbers; creates view of s2
s2_num # inside pandas, s2 is actually stored as an array in numpy

array([2, 4, 6])

In [10]:
np.shares_memory(s2_num,s2) # check if they share a memory

True

In [11]:
s2_num_copy = s2.to_numpy(copy = True) # another method to change to numpy; copy=True creates copy, copy=False creates view
# more recommended in new version of Pandas -- can specify view/copy
np.shares_memory(s2_num_copy,s2) # it's a copy, they dont share an object

False

Selection by position -- similar to Numpy array!

In [12]:
s2[0:2] # first 2 elements of s2; 0 and 1 index, 2 isn't a part of it

a    2
b    4
dtype: int64

Selection by index (label)

In [14]:
s2['a'] # picks up a single row

2

In [13]:
s2[['a','c']] # picks up 2 rows; don't have to be consecutive

a    2
c    6
dtype: int64

`Series` and Python Dictionary

In [49]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135} # this is the built-in python dictionary
# you have something separated by columns, elements separated by a comma
# 'California' and the other state names are what's called keys
population = pd.Series(population_dict) # initialize Series with dictionary
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [16]:
population_dict['Texas'] # key and value; to summon data of a certain key; in Python dictionary

26448193

In [18]:
population['Texas'] # same thing, but in pandas dataframe

26448193

In [50]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

Create the pandas `DataFrame` from `Series`. Note that in Pandas, the row/column of `DataFrame` are termed as `index` and `columns`.

In [51]:
states = pd.DataFrame({'Population': population,
                       # population is a column; columns are one-dimension, but data frames are 2 dimensional
                       'Area': area}) # variable names
states

Unnamed: 0,Population,Area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [21]:
type(states)

pandas.core.frame.DataFrame

In [22]:
states.index # names (row)

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [23]:
states.columns # variable names (column)

Index(['Population', 'Area'], dtype='object')

In [25]:
states['Area'] # be careful about capital letters when assigning variables; they matter
# this is series, it's very different from data frame given above

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: Area, dtype: int64

In [26]:
states.Area # the columns are now stored as attributes for this data frame

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: Area, dtype: int64

In [27]:
type(states['Area'])

pandas.core.series.Series

In [28]:
random = pd.DataFrame(np.random.rand(3, 2),columns=['foo', 'bar'],index=['a', 'b', 'c'])
# you can create a data frame from a numpy array; very basic and flexible
random

Unnamed: 0,foo,bar
a,0.654325,0.030998
b,0.42391,0.85679
c,0.058505,0.190484


In [29]:
random.T # this is the transpose of the data above

Unnamed: 0,a,b,c
foo,0.654325,0.42391,0.058505
bar,0.030998,0.85679,0.190484
