# PYTHON

### Install
`brew install --cask anaconda`
#### Set the PATH variable to include the location of the Anaconda distribution
For Intel CPU: `echo 'export PATH=/usr/local/anaconda3/bin:$PATH' >> ~/.zshrc`
#### Reload the configuration file to allow the changes to take effect
`source ~/.zshrc`

`python --version` to verify installation
### Using Python
- In Terminal: `python`/`ipython`, `quit()`/`help()`
- To use `code`: `cmd+shft+P` in VSCode and click `Install code...path`

### Python notes

[String Methods](https://www.w3schools.com/python/python_ref_string.asp)

[List Methods](https://www.w3schools.com/python/python_ref_list.asp)

- *args unpacks list of arguments
- \**kwargs unpack dictionary of args where keys are parameters and values are arguments

    - e.g. `rand_func(argu_1 = 1, *args, **kwargs)`

#### String Formatting

In [12]:
name = 'World'
str1 = 'Hello, %s!' % name
print('Hello, %s!' % name)
str2 = 'Hello, {}!'.format(name)
print('Hello, %s!' % name)
str3 = f'Hello {name}!'
print('Hello, %s!' % name)

Hello, World!
Hello, World!
Hello, World!


# DS Libraries

### [numpy](https://numpy.org/doc/stable/)

In [13]:
import numpy as np

In [14]:
# array
a = np.array([1, 2, 3])

# matrix
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9],
                   [10, 11, 12]])

# index r x c
print(matrix[1:, :2])

matrix.shape

[[ 4  5]
 [ 7  8]
 [10 11]]


(4, 3)

In [15]:
# vectorized operations
# i.e. we can operate on an array with an operator, unlike on a list
# including comparison operators

In [16]:
print(matrix + 2)
print(matrix > 8)

[[ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]
 [12 13 14]]
[[False False False]
 [False False False]
 [False False  True]
 [ True  True  True]]


In [17]:
# Can index/filter array/matrix with conditional, array returned
matrix[matrix % 2 == 0]

array([ 2,  4,  6,  8, 10, 12])

[np.random](https://numpy.org/doc/stable/reference/random/index.html#module-numpy.random).randn()

In [18]:
# array sample size 10
np.random.randn(10)

array([ 0.26400169,  0.67710848, -0.00749384, -0.55002934,  1.03815064,
       -0.16349687,  0.91270525,  0.82213286,  1.16396882, -1.48853388])

In [19]:
# matrix sample
np.random.randn(3, 4)

array([[-0.98022898, -0.07788525,  0.09117397,  0.31879588],
       [ 0.6799653 , -0.5230222 ,  1.13391162, -0.22241111],
       [ 2.35470193, -0.32744803,  0.63880289,  0.30988093]])

In [20]:
mu = 100
sigma = 20

sigma * np.random.randn(20) + mu

array([ 85.36516404, 105.6936453 ,  89.61954853, 105.35734702,
        93.28915034,  92.96363448,  68.87326956,  97.64253695,
        80.82893479,  76.12990633,  80.01297509,  82.24792643,
       108.37390146,  92.5863223 ,  83.67464715,  65.90844458,
       121.16269402, 140.21229476, 108.47694508,  96.61625513])

In [21]:
print(np.zeros(3))
print(np.ones(3))
print(np.full(3, 17))
print(np.arange(4)) # max - 1
print(np.arange(1, 4)) # min, max - 1
print(np.arange(1, 9, 2)) # min, max - 1, step
print(np.linspace(1, 4, 7)) # min, max, elements

[0. 0. 0.]
[1. 1. 1.]
[17 17 17]
[0 1 2 3]
[1 2 3]
[1 3 5 7]
[1.  1.5 2.  2.5 3.  3.5 4. ]


[Array methods](https://numpy.org/doc/stable/reference/arrays.ndarray.html#array-methods): `.min()`, `.max()`, `.mean()`, `.std()`, `.sum()`, etc

### [pandas](https://pandas.pydata.org/docs/)

##### Series
**Part 1**

From a list, array, dictionary: - `myseries = pd.Series(<list or array or dictionary>)`

From existing dataframe:

`myseries = df['col_for_series']`

`myseries = df.col_for_series`

In [23]:
import pandas as pd
from pydataset import data

my_list = [2, 3, 5]
print(pd.Series(my_list), '\n')

my_array = np.array([8.0, 13.0, 21.0])
print(pd.Series(my_array), '\n')

labeled_series = pd.Series({'a' : 0, 'b' : 1.5, 'c' : 2, 'd': 3.5, 'e': 4, 'f': 5.5})
print(labeled_series, '\n')

sleep_df = data('sleepstudy')
sleep_df.head()

0    2
1    3
2    5
dtype: int64 

0     8.0
1    13.0
2    21.0
dtype: float64 

a    0.0
b    1.5
c    2.0
d    3.5
e    4.0
f    5.5
dtype: float64 



Unnamed: 0,Reaction,Days,Subject
1,249.56,0,308
2,258.7047,1,308
3,250.8006,2,308
4,321.4398,3,308
5,356.8519,4,308


In [None]:
print(sleep_df.Reaction, '\n')
sleep_df['Reaction']

In [None]:
sleep_df[['Reaction']] # df

Data types you will see in series and dataframes:

- int: integer, whole number values
- float: decimal numbers
- bool: true or false values
- object: strings
- category: a fixed set of string values
- a name, an optional human-friendly name for the series
- inferring
- using `<series>.astype()` to type cast

*[Series Attributes](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)*
- `.index`, `.values`, `.dtype`, `.name`, `.size`, `.shape`

*[Series Methods](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)*
- `.head/tail()`, `.sample()`, `.value_counts()`, descriptive stats, including `.describe()`, `.nsmallest/nlargest()`
- `.sort_values/index()`

**Part 2**

In [24]:
my_series = pd.Series([3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5])
print(my_series >= 5, '\n') # vector conditional operation
print(~(my_series >= 5), '\n') # inverts the boolean mask
print(my_series[my_series >= 5]) # index using boolean sequence
# can create compound operators seperated by parentheses and | or and

0     False
1     False
2     False
3     False
4      True
5      True
6     False
7      True
8      True
9     False
10     True
dtype: bool 

0      True
1      True
2      True
3      True
4     False
5     False
6      True
7     False
8     False
9      True
10    False
dtype: bool 

4     5
5     9
7     6
8     5
10    5
dtype: int64


< Series >.describe() is a Series itself that can be indexed

In [25]:
sleep_df = data('sleepstudy')
sleep_reaction_time_series = sleep_df.Reaction
print(sleep_reaction_time_series.describe()) # series
sleep_reaction_time_series.describe()['75%']

count    180.000000
mean     298.507892
std       56.328757
min      194.332200
25%      255.375825
50%      288.650800
75%      336.752075
max      466.353500
Name: Reaction, dtype: float64


336.752075

Can apply string methods to Series

In [26]:
ds_team_series = pd.Series(['Adam', 'Adam', 'Andrew', 'Carina',
                            'Madeleine', 'Misty', 'Margaret'])
ds_team_series.str.lower() 


0         adam
1         adam
2       andrew
3       carina
4    madeleine
5        misty
6     margaret
dtype: object

In [27]:
# can chain multiple string methods
ds_team_series[ds_team_series.str.lower().str.startswith('a')]

0      Adam
1      Adam
2    Andrew
dtype: object

More Series Methods:
- `.any()`: returns single boolean. Do any values in the series meet the condition?
- `.all()`: returns single boolean. Do all values in the series meet the condition?
- `.isin()`: Returns series of boolean values. Is each string in your series in the list of strings? 
- `.apply()`: apply a function to each item in a series.

**Part 3**

-  `pd.cut(series, bins = n_bins/array_edges)` put numerical values into discrete bins

In [28]:
# create bins of equal intervals
reaction_bins_series = pd.cut(sleep_reaction_time_series, 4)
reaction_bins_series

1       (194.06, 262.338]
2       (194.06, 262.338]
3       (194.06, 262.338]
4      (262.338, 330.343]
5      (330.343, 398.348]
              ...        
176    (262.338, 330.343]
177    (330.343, 398.348]
178    (330.343, 398.348]
179    (330.343, 398.348]
180    (330.343, 398.348]
Name: Reaction, Length: 180, dtype: category
Categories (4, interval[float64, right]): [(194.06, 262.338] < (262.338, 330.343] < (330.343, 398.348] < (398.348, 466.354]]

In [29]:
reaction_bins_series.value_counts()

(262.338, 330.343]    75
(194.06, 262.338]     53
(330.343, 398.348]    44
(398.348, 466.354]     8
Name: Reaction, dtype: int64

Can bin in `.value)counts(bins=n)`

In [30]:
sleep_reaction_time_series.value_counts(bins = 5)

(248.736, 303.141]    70
(303.141, 357.545]    48
(194.059, 248.736]    35
(357.545, 411.949]    20
(411.949, 466.354]     7
Name: Reaction, dtype: int64

- `.plot()` after a series to quickly plot and visualize.
    - `.plot.hist()`
        - `bins`: The number of bins to use in the histogram.
        - `range`: range of values to use in histogram.
        - `density`: Whether to normalize the histogram such that the area under the histogram sums to 1.
        - `cumulative`: Whether to compute a cumulative histogram.
        - `histtype`: histogram type. 'bar', 'barstacked', 'step', 'stepfilled'.
        - `align`: bar alignment. 'left', 'mid', 'right'.
        - `orientation`: histogram orientation. 'horizontal', 'vertical'.
        - `color`: bar colors.
        - `alpha`: bar transparency.
        - `label`: label for the histogram in a legend.
        - `stacked`: Whether to stack multiple histograms on top of each other.
    - `.plot.bar()`
        - parameters: `color`: bar color, `edgecolor`: bar edge color, `linewidth`: bar edge width, `width`: bar width, `alpha`: bar transparency, `grid`: grid display, `xlabel`, `ylabel`: x/y-axis labels, `title`: chart title

##### DataFrames

In [31]:
np.random.seed(123)

students = ['Sally', 'Jane', 'Suzie', 'Billy', 'Ada', 'John', 'Thomas',
            'Marie', 'Albert', 'Richard', 'Isaac', 'Alan']

# rand gen scores (arrays) for each student for each subject with 60-100 range
# the arrays need to have the same length here
math_grades = np.random.randint(low=60, high=100, size=len(students))
english_grades = np.random.randint(low=60, high=100, size=len(students))
reading_grades = np.random.randint(low=60, high=100, size=len(students))

df = pd.DataFrame({'name': students,
                   'math': math_grades,
                   'english': english_grades,
                   'reading': reading_grades})

type(df)
df

Unnamed: 0,name,math,english,reading
0,Sally,62,85,80
1,Jane,88,79,67
2,Suzie,94,74,95
3,Billy,98,96,88
4,Ada,77,92,98
5,John,79,76,93
6,Thomas,82,64,81
7,Marie,93,63,90
8,Albert,92,62,87
9,Richard,69,80,94


Summarizing DFs:
- `.info()`
- `.describe()`
- `.dtypes`
- `.shape`
- `.columns`
- `.index`
    - `df.columns = [col.upper() for col in df.columns]` example column name transformation
- `df[['col1','col2']]`
- `df.col1` to access single column as a series




In [32]:
df[df.math < 80]

Unnamed: 0,name,math,english,reading
0,Sally,62,85,80
4,Ada,77,92,98
5,John,79,76,93
9,Richard,69,80,94


drop and rename columns

In [33]:
df.drop(columns=['english', 'reading']) # df not updated

Unnamed: 0,name,math
0,Sally,62
1,Jane,88
2,Suzie,94
3,Billy,98
4,Ada,77
5,John,79
6,Thomas,82
7,Marie,93
8,Albert,92
9,Richard,69


In [34]:
df = df.rename(columns={'name': 'student'})
df

Unnamed: 0,student,math,english,reading
0,Sally,62,85,80
1,Jane,88,79,67
2,Suzie,94,74,95
3,Billy,98,96,88
4,Ada,77,92,98
5,John,79,76,93
6,Thomas,82,64,81
7,Marie,93,63,90
8,Albert,92,62,87
9,Richard,69,80,94


create column

In [37]:
df['passing_math']  = df.math >= 70
df = df.assign(passing_english=df.english >= 70)
df

Unnamed: 0,student,math,english,reading,passing_math,passing_english
0,Sally,62,85,80,False,True
1,Jane,88,79,67,True,True
2,Suzie,94,74,95,True,True
3,Billy,98,96,88,True,True
4,Ada,77,92,98,True,True
5,John,79,76,93,True,True
6,Thomas,82,64,81,True,False
7,Marie,93,63,90,True,False
8,Albert,92,62,87,True,False
9,Richard,69,80,94,False,True


Sort by valued

In [38]:
df.sort_values(by='english', ascending=False)

Unnamed: 0,student,math,english,reading,passing_math,passing_english
10,Isaac,92,99,93,True,True
3,Billy,98,96,88,True,True
4,Ada,77,92,98,True,True
0,Sally,62,85,80,False,True
9,Richard,69,80,94,False,True
1,Jane,88,79,67,True,True
5,John,79,76,93,True,True
2,Suzie,94,74,95,True,True
6,Thomas,82,64,81,True,False
7,Marie,93,63,90,True,False


### [matplotlib](https://matplotlib.org/stable/index.html)



### [seaborn](https://seaborn.pydata.org/api.html)