<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Python-Functions-&amp;-Objects-for-Data-Science" data-toc-modified-id="Python-Functions-&amp;-Objects-for-Data-Science-1">Python Functions &amp; Objects for Data Science</a></span><ul class="toc-item"><li><span><a href="#Purpose" data-toc-modified-id="Purpose-1.1">Purpose</a></span></li><li><span><a href="#Count-Vectorizer" data-toc-modified-id="Count-Vectorizer-1.2">Count Vectorizer</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Parameters" data-toc-modified-id="Parameters-1.2.0.1">Parameters</a></span></li><li><span><a href="#Methods" data-toc-modified-id="Methods-1.2.0.2">Methods</a></span></li></ul></li><li><span><a href="#Lambda-Functions" data-toc-modified-id="Lambda-Functions-1.2.1">Lambda Functions</a></span></li><li><span><a href="#Label-Encoder" data-toc-modified-id="Label-Encoder-1.2.2">Label Encoder</a></span></li></ul></li></ul></li></ul></div>

# Python Functions & Objects for Data Science

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Purpose
This post is intended to be an ever growing list of functions that would be handy for any **Data Scientist** to know. Use the table of contents to navigate to a particular function. 

## Count Vectorizer
***
**Library:** sklearn  
**Official documentation:** https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html  
**Description:** The count vectorizer function is intended to be used with text data to help convert data such as sentences into meaningful features for machine learning algorithms. In essence, the function counts the frequency of each word in each data-point/row of the data and adds it as a feature.   
**Example:**  

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
data = ["This is an example sentence.",
        "This is another example sentence",
        "The milky way is 105,700 light years wide",
        "Data Science is an amalgamation of computer science and statistics"]

# Initialize a CountVectorizer object creation with Constructor
cv = CountVectorizer(input = data, lowercase = True,  stop_words = 'english', ngram_range = (1,1), analyzer = 'word')

#### Parameters
There are 5 parameters that we used above to initialize our CountVectorizer object:  
***
**input:** which is the data  

**lowercase:** ensures that each word in our data is first converted to lowercase before it is processed  

**stop_words:**  which ensures that generic words such as "the", "and", "at", etc are ignored  

**ngram_range(a,b):** helps decide how many characters or words will be used per feature. If you set a = 1 and b = 2, then single words and pairs of adjacent words will be used to form the features. *There are examples for both below.*  

**analyzer:** this parameter can take values of 'word', 'char' or 'char_wb'. They respectively help the tokenizer decide whether a feature to be considered is a word, or a character.  The ngram_range would then respectively be applied to words or characters. The difference between 'char' and 'char_wb' is how the . tokenizer thinks of word boundaries. For instance, for an n-gram = 2, the word 'stop' would be tokenized as 'st', 'to', 'op' when the analyzer is set to 'char'. However, if set to 'char_wb' it would be tokenized as ' s', st', 'to', 'op', 'p '. In essence, thee characters at the word boundaries are padded with white spaces.    
***

#### Methods
Once initialized, the CountVectorizer object has many functions/methods that you could take advantage of; some important ones to keep in consideration are:  
**fit -** which can be used to fit our count vectorizer object to our data   
**fit_transform -** which fits the count vectorizer  object to your data and also returns the respective term-document matrix in sparse matrix form. Hence, to be able visualize it, you will need to convert it to an array and then a dataframe with appropriate column and index names.  

In [10]:
vectorized_data = cv.fit_transform(data).toarray()
pd.DataFrame(vectorized_data, columns=sorted(cv.vocabulary_), index=data)

Unnamed: 0,105,700,amalgamation,computer,data,example,light,milky,science,sentence,statistics,way,wide,years
This is an example sentence.,0,0,0,0,0,1,0,0,0,1,0,0,0,0
This is another example sentence,0,0,0,0,0,1,0,0,0,1,0,0,0,0
"The milky way is 105,700 light years wide",1,1,0,0,0,0,1,1,0,0,0,1,1,1
Data Science is an amalgamation of computer science and statistics,0,0,1,1,1,0,0,0,2,0,1,0,0,0


Let's repeat the example, but this time change the number of ngrams to 1 and 2. This will generate features of words and adjacent word pairs.

In [11]:
cv = CountVectorizer(input = data, lowercase = True,  stop_words = 'english', ngram_range = (1,2), analyzer = 'word')
vectorized_data = cv.fit_transform(data).toarray()
pd.DataFrame(vectorized_data, columns=sorted(cv.vocabulary_), index=data)

Unnamed: 0,105,105 700,700,700 light,amalgamation,amalgamation computer,computer,computer science,data,data science,...,science,science amalgamation,science statistics,sentence,statistics,way,way 105,wide,years,years wide
This is an example sentence.,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
This is another example sentence,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
"The milky way is 105,700 light years wide",1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,1,1,1,1,1
Data Science is an amalgamation of computer science and statistics,0,0,0,0,1,1,1,1,1,1,...,2,1,1,0,1,0,0,0,0,0


In [12]:
cv.vocabulary_

{'example': 10,
 'sentence': 19,
 'example sentence': 11,
 'milky': 14,
 'way': 21,
 '105': 0,
 '700': 2,
 'light': 12,
 'years': 24,
 'wide': 23,
 'milky way': 15,
 'way 105': 22,
 '105 700': 1,
 '700 light': 3,
 'light years': 13,
 'years wide': 25,
 'data': 8,
 'science': 16,
 'amalgamation': 4,
 'computer': 6,
 'statistics': 20,
 'data science': 9,
 'science amalgamation': 17,
 'amalgamation computer': 5,
 'computer science': 7,
 'science statistics': 18}

***
***

## Lambda Functions
Lambda functions are short anonymous in-line functions that are used in place of defining and calling a regular functions. Lambda functions are usually used in the event that the function you want would only be used that one time.

## Label Encoder


## Single Imputation
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer

from sklearn.preprocessing import Imputer
https://kite.com/python/docs/sklearn.preprocessing.Imputer

## Multiple Imputation
https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer

## Cross validation
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
