# Introduction

This file is a cumulation of python functions that I have been learning and using, primarily related to machine learning and artificial intelligence. 

The file is indexed below into its subsections and each function has a definition, example and link for further information.

<h4><a href="#pandas">1.0 Pandas (Data Tables)</a></h4>
<a href="#drop">agg()</a><br>
<a href="#drop">drop()</a><br>
<a href="#groupby">groupby()</a><br>
<a href="#loc">loc()</a><br>
<a href="#nunique">nunique()</a><br>
<a href="#readcsv">read_csv()</a><br>
<a href="#tocsv">to_csv()</a>
<h4><a href="#sklearn">2.0 SKLearn</a></h4>
<a href="#labelencoder">LabelEncoder()</a><br>
<a href="#onehotencoder">OneHotEncoder()</a><br>
<a href="#pipeline">Pipeline()</a><br>
<a href="#simpleimputer">SimpleImputer()</a><br>
<a href="#traintestsplit">train_test_split()</a><br>
<h4><a href="#general">3.0 General</a></h4>
<a href="#intersection">intersection()</a><br>
<a href="#issubset">issubset()</a><br>
<a href="#set">set()</a><br>
<a href="#zip">zip()</a><br>


##### Notes:
- pandas is a library used to analyze data and can be directly imported into any file using `import pandas as pd` *pd is optional and simply a shortform* 

<h2 id="pandas">Pandas (Data Tables)

In [3]:
import pandas as pd

### <p id="agg">`agg()`</p>
**Definition**: aggregate using one or more operations over the specified axis. Can be used on DataFrames, Series
Common parameters:

    args: positional arguments to pass to func
    axis: {0 or 'index', 1 or 'columns'}, default 0

<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html?highlight=agg#pandas.DataFrame.agg">Documentation</a>

In [4]:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9],
                   [np.nan, np.nan, np.nan]],
                  columns=['A', 'B', 'C'])
df.agg(['sum','min'])
df.agg("mean", axis="columns")

0    2.0
1    5.0
2    8.0
3    NaN
dtype: float64

### <p id="drop">`drop()`</p>
**Definition**: remove selected column(s) from table
Common options:

    axis: {0 or 'index', 1 or 'columns'}, default 0

<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html?highlight=drop">Documentation</a>

In [5]:
#data.drop(['Col1'])

### <p id="groupby">`groupby()`</p>
**Definition**: group by using a mapper or by a Series of columns
Common options:

    axis: {0 or 'index', 1 or 'columns'}, default 0
    level: int, level name, or sequence of such, default None

<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html#pandas-dataframe-groupby">Documentation</a>

In [6]:
import pandas as pd
arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
          ['Captive', 'Wild', 'Captive', 'Wild']]
index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},
                  index=index)

#df.groupby(level=0).mean()

### <p id="loc">`loc()`</p>

**Definition:** access a group of rows and columns by label(s) or a boolean array.
    
<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html">Documentation</a>

In [7]:
import pandas as pd

df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
     index=['cobra', 'viper', 'sidewinder'],
     columns=['max_speed', 'shield'])
df.loc[['viper', 'sidewinder']]
#df.loc[:,'shield']

Unnamed: 0,max_speed,shield
viper,4,5
sidewinder,7,8


### <p id="nunique">`nunique()`</p>

**Definition:** return number of unique elements in object
Common options:

    axis: {0 or 'index', 1 or 'columns'}, default 0
    
<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html#pandas.DataFrame.nunique">Documentation</a>

### <p id="read_csv">`read_csv()`</p>

Def: takes a filepath and reads the information into a DataFrame. Common options:

    sep: str,default","

For more information and available options: <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html">Documentation</a>

In [8]:
import csv
with open('example.csv','w') as file:
    filewriter = csv.writer(file)
    filewriter.writerow(['Col1','Col2'])
    filewriter.writerow(['First','Second'])

data = pd.read_csv('./example.csv')
#data.head()

### <p id="to_csv">`to_csv()`</p>

Def: writes a DataFrame to a new csv file
Common options:

    index: bool,default=True

For more information and available options: <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html#pandas-dataframe-to-csv">Documentation</a>

In [9]:
import pandas as pd
animals = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]}, index=['Year 1', 'Year 2'])
#animals.to_csv('cows_and_goats.csv')

In [10]:
#data.nunique()

<h2 id="sklearn">SKLearn

### <p id="labelencoder">`LabelEncoder()`</p>

**Definition:** Encode target labels with value between 0 and n_classes-1. It can be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical values
        
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html">Documentation</a>

In [11]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit([1, 2, 2, 6])
le.classes_
le.transform([1, 1, 2, 6])
le.inverse_transform([0, 0, 1, 2])

array([1, 1, 2, 6])

### <p id="onehoteencoder">`OneHotEncoder()`</p>

**Definition:** Encode categorical features as a one-hot numeric array
Common Parameters:

    handle_unknown: {'error', 'ignore'}, default='error'
        
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html">Documentation</a>

In [12]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore')
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)
enc.categories_
enc.transform([['Female', 1], ['Male', 4]]).toarray()
enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])
enc.get_feature_names(['gender', 'group'])

array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'],
      dtype=object)

### <p id="pipeline">`Pipeline()`</p>

**Definition:** Apply a list of transforms and a final estimator
Common Parameters:

    steps: list
    memory: str or object with the joblib.Memory interface, default='None'
    
Some Methods:

    fit(self, X[, y]): fit the model
    fit_predict(self, X[, y]: applies fit_predict of last step in pipeline after transforms
    fit_transform(self, X[, y]: fit the model and transform with the final estimator
    get_params: get parameters for this estimator
    set_params: set parameters for this estimator
    score(self, X[, y, sample_weight]): apply transforms and score with the final estimator
    
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">Documentation</a>

In [13]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=0)
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
# The pipeline can be used as any other estimator
# and avoids leaking the test set into the train set
pipe.fit(X_train, y_train)

pipe.score(X_test, y_test)

0.88

### <p id="simpleimputer">`SimpleImputer()`</p>

**Definition:** Imputation transformer for completing missing values   
Common Parameters:

    missing_values: number,string, np.nan(default) or None
    strategy: string, default='mean'
    
Methods:

    fit(self, X[, y]): fit the imputer on X
    fit_transform(self, X[, y]: fir to data, then transform it
    get_params: get parameters for this estimator
    set_params: set parameters for this estimator
    transform(self, X): impute all missing values in X
    
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html">Documentation</a>

In [14]:
import numpy as np
from sklearn.impute import SimpleImputer

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
#print(imp_mean.transform(X))

### <p id="traintestsplit">`train_test_split()`</p>

**Definition:** split arrays or matrices into random train and test subsets
Required:
    
    arrays: sequence of indexables with same length/shape[0]
    
Common options:

    random_state: int or RandomState instance, default=None
    
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html">Documentation</a>

In [15]:
import numpy as np
from sklearn.model_selection import train_test_split

X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
#print(X)
#print(y)
#print(X_train)
#print(y_train)
#print(X_test)
#print(y_test)

### <p id="issubset">`issubset()`</p>

**Definition:** Returns True if all items in the set exists in the specified set, otherwise it retuns False.

In [22]:
set1 = {2, 3, 4, 5, 6, 7, 8, 9}
set2 = {4, 9, 16, 25, 36, 49, 64, 81}
print(set1.issubset(set2))

False


### <p id="intersection">`intersection()`</p>

**Definition:** Return a set that contains the items that exist in both set x, and set y.

In [21]:
set1 = {2, 3, 4, 5, 6, 7, 8, 9}
set2 = {4, 9, 16, 25, 36, 49, 64, 81}
print(set1.intersection(set2))

{9, 4}


### <p id="set">`set()`</p>

**Definition:** A set is a collection which is unordered and unindexed. In Python, sets are written with curly brackets.

In [17]:
list1 = [2, 3, 4, 5, 6, 7, 8]
list2 = [4, 9, 16, 25, 36, 49, 64]
ourTuple = zip(list1,list2)
print(set(ourTuple))

{(6, 36), (8, 64), (4, 16), (5, 25), (3, 9), (7, 49), (2, 4)}


### <p id="zip">`zip()`</p>

**Definition:** returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together, and then the second item in each passed iterator are paired together etc.

If the passed iterators have different lengths, the iterator with the least items decides the length of the new iterator.

In [18]:
list1 = [2, 3, 4, 5, 6, 7, 8]
list2 = [4, 9, 16, 25, 36, 49, 64]
ourTuple = zip(list1,list2)
print(tuple(ourTuple))

((2, 4), (3, 9), (4, 16), (5, 25), (6, 36), (7, 49), (8, 64))
