`What is Numpy?`


NumPy stands for numerical python and it's an open source numerical library used for working with mathematical functions with multidimensional array and matrix data structures.

It's a very common library used in data science and machine learning.


`Why do we use Numpy?`

A normal python list is actually a group of pointers to separate python objects. for example, pointing to the individual python number objects inside of the list.

A numpy array is designed to be an array of uniform values without using extra memory space for type pointers.

This makes it much more efficient and actually use a lot less memory than a normal python list.

NumPy can also read in information faster than normal python and has lots of convenient broadcasting operations that can be performed across array dimensions.

`Operations`

**Numpy arrays and matrices**

In [None]:
import numpy as np

In [None]:
mylist = [1,2,3]
np.array(mylist)

nested_lst = [[1,2],[6,3], [3,4]]
np.array(nested_lst) # 2D matrix with 3 rows and 2 columns
# we can see the dimension by the number of the brackets

#built in ways to generate arrays
np.arange(0,10) # 0 upto and not including 10
np.arrange(0,10,2) #optional step size #quick way to get even numbers

# Often in DS/ML we want to quickly create arrays of just 0s
np.zeros(3) np.zeros((4,4))
np.ones(3) np.ones((3,10))

np.linspace(0,10,3)
np.linspace(0,10,21) #evenly spaced numbers
np.eye(5) #1 along the diagnol #by definition identity matrix is always square

#random functionalitites
#different bunch of random function calls
#many types of random distribution we can call from
np.random.
# 2 random numbers within a uniform distribution between 0 and 1
np.random.rand(2)
np.random.rand(3,4) # for matrix
# from standard normal distribution (centered/mean 0 variance 1)
np.random.randn(5,5)
# simulating people's ages
np.random.randint(1,100)
np.random.randint(1,100,10)
np.random.randint(1,100,(2,3))
#arbitrary number choice #important for reproducability sake # run in same cell
np.random.seed(101)
np.random.rand(4)

arr = np.arange(25)
arr.shape # just one dimension
arr.reshape(5,5) # not possible for dimensions that deosn't make sense
ranarr = np.random.randint(0,50,10)
ranarr.max() ranarr.argmax() #index location of my max value
ranarr.min() ranarr.argmin()
ranarr.dtype # grab the data type in the array

**Indexing and Slicing**

In [None]:
arr = np.arange(0,11)
arr
arr[8]
arr[1:5] #upto and not including 5
arr[:5] #start at the begining and go to 5
arr[5:] #all the way to the end

# broadcasting ability
# values from 0 to 5 have this broadcasting reassignment
arr
arr[:5] = 100
arr
arr = np.arange(0,11)
slice_of_array = arr[0:5]
slice_of_array
arr
slice_of_array[:] = 99 #from begning all the way to the end
slice_of_array
arr #braodcasting operation has affect the original array
# if you want to not have it affect the original array...
# you need to explicitely set a copy
arr_copy = arr.copy()
arr_copy[:] = 100
arr_copy
arr

#indexing on 2D arrays
arr_2d = np.array([[58,45,75],[20,25,30],[45.67,23]])
arr_2d
arr_2d.shape
arr_2d[0]
arr_2d[2]
arr_2d[1][1]
arr_2d[1,1]
arr_2d[0,2]
#with real data sets rows are data points and columns are the features
arr_2d[:2]
arr_2d[:2,1:]
# its unlikely we need to do these sort of double subset
# typically with real data set we are subsetting either on rows or on columns

# Conditional Selection
arr = np.arange(1,11)
arr
arr > 4
# we can perform comparison in my array
# and then pass that as my conditional selection filter
bool_arr = arr > 4
bool_arr
arr[bool_arr] #filtering original array
# we will be doing these sort of things quite often especially with pandas
arr[arr > 4]

**Operation**

In [None]:
# perform operation from number to number basis
arr = np.arange(0,10)
arr + 5
arr - 2
arr + arr     # it has to be same shape for this to work
arr * arr
arr - arr #any sort of operation

# some universal array function
np.sqrt(arr)
np.sin(arr)
np.sin(arr)
arr.sum()
arr.mean()
arr.var()
arr.std()
# we are using one dimenion so its taking everything into account

# So lets try for 2D array
np.arange(0,25)
arr2d = np.arange(0,25).reshape(5,5)
arr2d.sum()

# sum for all the row and columns
arr_2d.sum(axis = 0) #accross the rows
arr_2d.sum(axis = 1)

#**Pandas**
Stands for Panel-Data

Its the most popular library for data handling for python.

It is built directly off of numpy so we will see lot of similarities.

Read in our data, clean the data and even perform feature engineering with Pandas.

Useful across many aspects of a data analysis procedures.

Pandas relies on some core data structures for its operation, which we will learn today.

They are pandas series which is a numpy data array with named index.

Then we will expand this to the really core structure of pandas i.e., data frame.

pydata.org/pandas-docs/stable

In [None]:
##### Pandas Series
# similar to numpy array only with named index
labels = ['a','b','c']
mylist = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}
pd.Series(data = mylist)  #by default we will have index
# but still we can identify with thier ineger location
pd.Series(data = mylist, index = labels)
# we do not have to specify this as long as we provide in correct order
pd.Series(mylist, labels)
pd.Series(arr)
pd.Series(arr, labels)
pd.Series(d)

#Lets build some complex series
salesQ1 = pd.Series(data = [250,456,234,456], index = ['USA', 'J', 'C', 'I'])
salesQ2 = pd.Series(data = [250,500,334,456], index = ['Bhutan','J', 'B','I'])
salesQ2['Brazil']
# based off the index location or integer based location
salesQ2[2]

# we can perform operations between Series
salesQ1 + salesQ2 # default behavior of NaN if they ain both seriesre not

In [None]:
#### Pandas DataFrame
# Multi-pandas Series sharing the same index
columns = ['A','B','C','D']
index = ["V","W","X","Y","Z"]
from numpy.random import randint
np.random.seed(42)
data = randint(-100,100,(5,4))
data

# creating pandas DataFrame
df = pd.DataFrame(data, index, columns)               df

# How do we grab a single column
df["A"]
type(df["A"] )      # its a series
df[["A","B"]] #double bracket is because we're actually passing in a list of columns
df[["B","A"]] #does not have to be on order

#### Feature engineering --
# where you want to create new column with existing columns
# we simply reference it as it already exists
# and then give the formualtion for what we want to construct
df['new'] = df["A"] + df["B"]
df

# removing a column
df.drop('new', axis = 1)   # remember that its not inplace dropping
df
# permanent dropping
df = df.drop('new', axis = 1) # reassignmnet
df.drop('new', axis = 1, inplace = True)

### Same opeartaions for rows
df.loc["V"]  # tells pandas that we are looking for a row
df.loc[["V", "W"]]

df.iloc[0]
df.iloc[-1] #grabbing things by an index integer based location
df.iloc[0:3]

#dropping rows by name
df.drop('X') # by default axis = 0

# selecting subset of rows and column at the same time
df.loc["V", "C"] # similar logic we used with numpy
df.loc[["V","W"], "C"]
df.loc[["V","W"], ["A", "C"]]

In [None]:
df
# grabbing subset of dataframe which only has +ve values
df > 0 # here we are performing the comparision operator accorss the dataframe
df[df > 0] #passing it in to the dataframe as a filtering condition
df["A"] > 0
df[df["A"] > 0]  #take this condition and pass in our dataframe
df[df["A"] > 0]["C"]

# lets make two condition
df[(df["A"] > 0) & (df["B"] > 1) ]
df[(df["A"] > 0)  (df["B"] > 1) ]

# resetting the index
df.reset_index()

new_ind = ["Ca", "Ny", "Wy", "Or", "Co"]
df['states'] = new_ind
df
df.set_index('states')
df.columns

In [None]:
# Missing data   #educated guess / common sense
df = pd.DataFrame({'A':[4,3,np.nan,2],'B':[90,np.nan,np.nan,2],'C': [12,23,34,45]})
df
# remove
df.dropna()
df.dropna(axis = 1)
df.dropna(axis = 1, thresh = 3)
# fill in missing values
df.fillna(value = 'FILL VALUE')
df.fillna(value = 0)
# filling based of columns
df['A'].fillna(value = 0)
df['A'] = df['A'].fillna(value = 0)
# with mean
df["B"].mean()
df['B'].fillna(value = df["B"].mean())

In [None]:
# GroupBy ---> must be an aggregation method
df.groupby('Year').sum()
df.groupby('Year').sum().sort_index(ascending = False)

# group by year and then sector
df.groupby(['Year', 'Sector']).sum() #multitiered index or multi hierarchy index
df.groupby('Year').describe() # multi tiered columns