###Packages and Library Functions

Library (or package/sometimes words are used interchangeably)
- Reusable collection of code, related modules, and documentation

Packages in python work similar to packages in R in that 
- they extend and enhance code that are not included in basic python
- to access them we need to import
- they contain modules (a simply python file containing a collection of functions and global variables)
-- global variables- variables that can be accessed from anywhere in a program or script


**matplotlib**- a library for creating static, animated, and interactive visualizations in Python

**Seaborn**- a data visualization library based on matplotlib that provides a simpler interface for working with common plots and graphs

**NumPy** (Numerical Python)
- essential library contains multidimensional array and matrix data structures and functions to manipulate them
- used for scientific computation

**pandas** (Python Data Analysis)
- powerful library built on top of NumPy that's used to manpulate and analyze tabular data
- note powered by numpy

Others to keep in mind/learn about
- scikit-learn
- statsmodels


##NumPy Intro

**Vectorization**
- enables operations to be performed on multiple components of a data object at the same time

In [2]:
#using our list data from last walkthrough
list_a = [1, 2, 3]
list_b = [2, 4, 6]

#vectorized computation

#note you may have to do 'pip install numpy' in your command prompt if you have not already
import numpy as np #import and aliasing for ease

# Convert lists to arrays
array_a = np.array(list_a)
array_b = np.array(list_b)

# Perform element-wise multiplication between the arrays
array_a * array_b

array([ 2,  8, 18])

**Arrays**

In [3]:
# The np.array() function converts an object to an ndarray
x = np.array([1, 2, 3, 4])
x

array([1, 2, 3, 4])

In [4]:
# Arrays can be indexed
x[-1] = 5
x

array([1, 2, 3, 5])

In [None]:
# Trying to access an index that doesn't exist will throw an error
x[4] = 10
#note to change an array you must reassign

In [10]:
# Arrays must contain the same data type
arr = np.array([1, 2, 'coconut'])
arr

#notice how it will change the integers to strings
    #python will only do this if possible

array(['1', '2', 'coconut'], dtype='<U11')

In [9]:
# NumPy arrays are a class called `ndarray`
print(type(arr))

<class 'numpy.ndarray'>


In [None]:
# The dtype attribute returns the data type of an array's contents
arr = np.array([1, 2, 3])
arr.dtype

In [11]:
# Create a 2D array by passing a list of lists to np.array() function
arr_2d = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
print(arr_2d.shape) #number of elements in each dimension of an array
print(arr_2d.ndim) #the number of dimensions in an array
arr_2d

(4, 2)
2


array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8]])

In [12]:
# The reshape() method changes the shape of an array
arr_2d = arr_2d.reshape(2, 4)
arr_2d

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

In [16]:
#Other functions:
# The log() method returns the natural logarithm of the elements in an array
np.log(arr_2d)

array([[0.        , 0.69314718, 1.09861229, 1.38629436],
       [1.60943791, 1.79175947, 1.94591015, 2.07944154]])

In [17]:
# The floor() method returns the value of a number rounded down
# to the nearest integer
np.floor(5.7)

5.0

In [18]:
# The ceil() method returns the value of a number rounded up
# to the nearest integer
np.ceil(5.3)

6.0

##Pandas Example with Titanic Data

In [1]:
# NumPy and pandas are typically imported together.
# np and pd are conventional aliases.
import numpy as np
import pandas as pd

In [13]:
# Read in data from a .csv file
titanic = pd.read_csv('https://raw.githubusercontent.com/adacert/titanic/main/train.csv')
#note similar to entering csv into R

# Print the first 15 rows
titanic.head(15)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [31]:
#dataframe.info() can display a summary of the dataset including range index, dtypes, column headers, and memory usage

titanic.info() #works similarly to summary in R

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
 12  2023_Fare    891 non-null    float64
dtypes: float64(3), int64(5), object(5)
memory usage: 90.6+ KB


In [None]:
#dataframe.describe() can display descriptive statistics of entire dataset, total count, mean, min, max, dispersion, and distribution
titanic.describe()

Calculating Manually

In [14]:
# Calculate the mean of the Age column
titanic['Age'].mean()

29.69911764705882

In [15]:
# Calculate the maximum value contained in the Age column
titanic['Age'].max()

80.0

In [16]:
# Calculate the minimum value contained in the Age column
titanic['Age'].min()

0.42

In [17]:
# Calculate the standard deviation of the values in the Age column
titanic['Age'].std()

14.526497332334042

In [18]:
# Return the number of rows that share the same value in the Pclass column
titanic['Pclass'].value_counts()

Pclass
3    491
1    216
2    184
Name: count, dtype: int64

In [9]:
# The describe() method returns summary statistics of the dataframe
titanic.describe()

#works similarly to the sumarize() feature in R

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Task: In the titanic table find all the rows that include passengers who are over 60 and are in class 3

In [19]:
# Filter the data to return only rows where value in Age column is greater than 60
# and value in Pclass column equals 3
titanic[(titanic['Age'] > 60) & (titanic['Pclass'] == 3)]

#note how this filter is different from R and SQL

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q
280,281,0,3,"Duane, Mr. Frank",male,65.0,0,0,336439,7.75,,Q
326,327,0,3,"Nysveen, Mr. Johan Hansen",male,61.0,0,0,345364,6.2375,,S
483,484,1,3,"Turkula, Mrs. (Hedwig)",female,63.0,0,0,4134,9.5875,,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S


In [20]:
# Use iloc to access data using index numbers.
# Select row 1, column 3.
titanic.iloc[1][3]

'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'

Task: Find the fare of the titanic tickets if they were to be sold today (inflation: times the tickets by 146.14)

In [21]:
# Create a new column called 2023_Fare that contains the inflation-adjusted
# fare of each ticket in 2023 pounds
titanic['2023_Fare'] = titanic['Fare'] * 146.14
titanic #note you don'tneed to add view unlike R

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,2023_Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,1059.515000
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,10417.341462
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,1158.159500
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,7760.034000
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,1176.427000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,1899.820000
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,4384.200000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,3426.983000
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,4384.200000


##Boolean Masking
- filtering technique that overlays a Boolean grid onto a dataframe in order to select only the values in the dataframe that align with the True values of the grid
- Note this does not change the data, it only shows!

In [22]:
# Making dictionary of planetary data
data = {'planet': ['Mercury', 'Venus', 'Earth', 'Mars',
                   'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
       'radius_km': [2440, 6052, 6371, 3390, 69911, 58232,
                     25362, 24622],
       'moons': [0, 0, 1, 2, 80, 83, 27, 14]
        }
# Use pd.DataFrame() function to convert dictionary to dataframe
planets = pd.DataFrame(data)
planets

Unnamed: 0,planet,radius_km,moons
0,Mercury,2440,0
1,Venus,6052,0
2,Earth,6371,1
3,Mars,3390,2
4,Jupiter,69911,80
5,Saturn,58232,83
6,Uranus,25362,27
7,Neptune,24622,14


In [23]:
# Create a Boolean mask of planets with fewer than 20 moons
mask = planets['moons'] < 20
mask

0     True
1     True
2     True
3     True
4    False
5    False
6    False
7     True
Name: moons, dtype: bool

In [24]:
# Apply the Boolean mask to the dataframe to filter it so it contains
# only the planets with fewer than 20 moons
planets[mask]

Unnamed: 0,planet,radius_km,moons
0,Mercury,2440,0
1,Venus,6052,0
2,Earth,6371,1
3,Mars,3390,2
7,Neptune,24622,14


In [25]:
# You can also define the Boolean mask and apply it in a single line
planets[planets['moons'] < 20]

Unnamed: 0,planet,radius_km,moons
0,Mercury,2440,0
1,Venus,6052,0
2,Earth,6371,1
3,Mars,3390,2
7,Neptune,24622,14


##Grouping

Grouping in Python works similar to in SQL

In [26]:
#Using our planet data with more things
data = {'planet': ['Mercury', 'Venus', 'Earth', 'Mars',
                   'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
        'radius_km': [2440, 6052, 6371, 3390, 69911, 58232,
                     25362, 24622],
        'moons': [0, 0, 1, 2, 80, 83, 27, 14],
        'type': ['terrestrial', 'terrestrial', 'terrestrial', 'terrestrial',
                 'gas giant', 'gas giant', 'ice giant', 'ice giant'],
        'rings': ['no', 'no', 'no', 'no', 'yes', 'yes', 'yes','yes'],
        'mean_temp_c': [167, 464, 15, -65, -110, -140, -195, -200],
        'magnetic_field': ['yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes']
        }

# Use pd.DataFrame() function to convert dictionary to dataframe
planets = pd.DataFrame(data)
planets

Unnamed: 0,planet,radius_km,moons,type,rings,mean_temp_c,magnetic_field
0,Mercury,2440,0,terrestrial,no,167,yes
1,Venus,6052,0,terrestrial,no,464,no
2,Earth,6371,1,terrestrial,no,15,yes
3,Mars,3390,2,terrestrial,no,-65,no
4,Jupiter,69911,80,gas giant,yes,-110,yes
5,Saturn,58232,83,gas giant,yes,-140,yes
6,Uranus,25362,27,ice giant,yes,-195,yes
7,Neptune,24622,14,ice giant,yes,-200,yes


In [27]:
# The groupby() function returns a groupby object
planets.groupby(['type'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000017B18080750>

In [None]:
# Apply the sum() function to the groupby object to get the sum of the values in each numerical column for each group
planets.groupby(['type']).sum()

In [28]:
# Apply the sum function to the groupby object and select
# only the 'moons' column
planets.groupby(['type']).sum()[['moons']]

Unnamed: 0_level_0,moons
type,Unnamed: 1_level_1
gas giant,163
ice giant,41
terrestrial,3
