###### Note when making models, we should always do some validation, be it cross validation or train-test split.  At times, the homework will ask you to build models without performing this task.  The goal of this is to get you used to using skleaern functionality, and not building robust models.

### Getting Data
* We will be working with 2 datasets, both of which are built into sklearn:
* Boston Housing
    * we will use this for prediction problems
    * grab the boston dataset using load_boston()
    * we will work with RAD, ZN, CRIM, AGE
    * thus, make a dataframe with these features
    * note, you can keep the target, y seperate
* Iris
    * we will use this for classification problems

In [15]:
from sklearn.datasets import load_boston, load_iris
import pandas as pd
import numpy as np

In [18]:
# load boston
boston = load_boston()
boston_x, boston_y = load_boston(return_X_y=True)
boston_df = pd.DataFrame(boston_x, columns = boston["feature_names"])[["RAD", "ZN", "CRIM", "AGE"]]
boston_df["y"] = boston_y

In [3]:
# load iris
iris = load_iris()
data = iris["data"]
labels = iris["target_names"]
feature_columns = iris["feature_names"]

iris_df = pd.DataFrame(data, columns = feature_columns)
iris_df["label"] = np.array([labels[x] for x in iris["target"]])

In [4]:
boston_df.head(1)

Unnamed: 0,RAD,ZN,CRIM,AGE,y
0,1.0,18.0,0.00632,65.2,24.0


In [5]:
iris_df.head(1)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
0,5.1,3.5,1.4,0.2,setosa


### Question 1
* make a class KNN that has methods to
    * init
        * should take k and distance measure
        * should accept 2 possible distance measures (euclidean and city block)
        * use error handling to check a proper distance measure is chosen and an odd k
        * in practice, we can pick even numbered Ks for nearest neighbor, but stick with odd so our voting won't have ties.
    * generate a distance matrix
        * you can use cdist from scipy or sklearn pairwise distances
    * get nearest neighbors
        * give the user the option to pass in a distance matrix or data
        * you can return the index value of the nearest neighbors, similar to how argsort works.
        * your output should be a row for each observation and a column for each nearest neighbor
        * don't worry about ties
    * get prediction
        * should support regression and classification
        * if classification, do a majority vote
        * if regression, do a mean of the nearest neighbors
    * predict probaility
        * for classification only
            * do a distribution of the class labels of the nearest neighbors
            * return this as a dictionary like {observation:[.1,.5,.4]} where the first element of the list value is the prob of class 1
* remember, limit for loops, make use of the matrix functions, argmax and min, argsort etc.  
    * be aware of how numpy handles ties, see documenation below, but for our purporses you can forgo worrying about the ties but be aware of how the sorting works
* run each method in your class on the iris dataset, showing the first row of each returned matrix
* make sure you are returning values when making predictions, similar to sklearn functionality
* https://numpy.org/doc/1.18/reference/generated/numpy.sort.html

In [330]:
from scipy.spatial.distance import cdist
from collections import Counter

class KNN:
    
    @staticmethod
    def mode(x):
        '''
        returns the mode of x, used to apply along
        rows of a numpy dataframe
        '''
        return Counter(x).most_common()[0][0]
    
    @staticmethod
    def dist(x,l,k):
        '''
        x is the array, meant to be applied along rows of numpy array
        l is labels, for each label, add a dict element where value is an 
        empty list, used to predict_proba
        k is the number of Ks for our nearest neighbor
        '''
        d = dict((k,[]) for k in l)
        return [list(x).count(l)/k for l in d.keys()]
    
    def __init__(self, k, distance_measure):
        assert distance_measure in ["euclidean", "cityblock"], "pick euclideaen or cityblock"
        assert k%2 == 1, "pick an odd K"
        self.k = k
        self.distance_measure = distance_measure

    def generate_distance_matrix(self, data):
        '''
        generate and return a distance matrix
        '''
        return cdist(data, data, self.distance_measure)
    
    def get_nearest_neighbors(self, data, is_distance = True):
        '''
        if a distance matrix is passed in, we can get the nearest
        neighbor directly from it, otherwise we must first make the
        distance matrix
        '''
        if is_distance:
            self.nearest_neighbors = distance_matrix.argsort(axis = 1)[:,1:self.k+1]
        else:
            dm = cdist(data, data, self.distance_measure)
            self.nearest_neighbors = dm.argsort(axis = 1)[:,1:self.k+1]
        
    def predict_prob(self, labels):
        '''
        labels must be preserved because they are needed for
        the way the probabilies are returned
        '''
        self.labels = set(labels)
        matr = np.array(labels)[knn.nearest_neighbors]
        results = np.apply_along_axis(KNN.dist, 1, matr, self.labels, self.k)
        return dict(enumerate(results))
    
    def predict(self, **kwargs):
        '''
        we can handle prediction or classification, but we need labels or
        values passed in, default to prediction, which would be the else
        '''
        assert "labels" not in kwargs or "values" not in kwargs, "need to pass in values or labels"
        if "labels" in kwargs:
            matr = np.array(kwargs["labels"])[knn.nearest_neighbors]
            return np.apply_along_axis(KNN.mode, 1, matr)
        else:
            return np.mean(np.array(kwargs["values"])[knn.nearest_neighbors],1)

In [331]:
knn = KNN(3, "euclidean")

In [332]:
distance_matrix = knn.generate_distance_matrix(iris_df.drop("label", 1))
distance_matrix.shape

(150, 150)

In [333]:
knn.get_nearest_neighbors(distance_matrix, False)
knn.nearest_neighbors[0]

array([17, 28,  4])

In [334]:
predictions = knn.predict(labels = iris_df["label"])
predictions[0]

'setosa'

In [335]:
probs = knn.predict_prob(iris_df["label"])
probs[0]

array([1., 0., 0.])

In [336]:
knn.labels

{'setosa', 'versicolor', 'virginica'}

### Question 2
* make a generator that
    * takes each sentence and splits on the space (" ")
    * make sure to strip items (remove trailing and leading white space).  .strip, .lstrip and .rstrip can achieve this
    * get rid of punctuation
    * as you generate documents make 4 dictionaries
        * an index to word where the keys are an index value (starting at 0) and the values are words
        * word to index dict (converse of the above dict)
        * a document to list of tokens (index values in the list, not the word itself)
        * a count for each word where keys are words and values are counts
    * don't store punctuation or white space as keys in the dictionary
    * make sure to iterate the generator itself, not a list
    * note, you can use a double for loop her.  For documenet in corpus: for token in document
    * print out the dictionaries at the end

In [7]:
corpus = [
    "this one time at band camp...",
    "this one time i went to band camp",
    "one time while at basketball camp",
]

In [8]:
idx_to_word_dct = {}
docs_token_dct = {}
word_count_dct = {}
word_to_idx_dct = {}

In [9]:
corpus_gen = (x.lower().strip() for x in corpus)
print(type(corpus_gen))
corpus_gen = (x.replace(".","").replace(",","").replace("?","") for x in corpus_gen)
print(type(corpus_gen))
corpus_gen = (x.split(" ") for x in corpus_gen)
print(type(corpus_gen))

<class 'generator'>
<class 'generator'>
<class 'generator'>


In [10]:
idx_counter = 0

for doc_idx,doc in enumerate(corpus_gen):
    # for each document add to the dictionary
    docs_token_dct[doc_idx] = [] 
    
    for token in doc:
        
        if token not in word_to_idx_dct:
            idx_to_word_dct[idx_counter] = token
            word_to_idx_dct[token] = idx_counter
            idx_counter +=1            
        else:
            pass
        
        docs_token_dct[doc_idx].append(word_to_idx_dct[token])
            
    
        if token in word_count_dct:
            word_count_dct[token] += 1
        else:
            word_count_dct[token] = 1

In [203]:
idx_to_word_dct

{0: 'this',
 1: 'one',
 2: 'time',
 3: 'at',
 4: 'band',
 5: 'camp',
 6: 'i',
 7: 'went',
 8: 'to',
 9: 'while',
 10: 'basketball'}

In [204]:
corpus

['this one time at band camp...',
 'this one time i went to band camp',
 'one time while at basketball camp']

In [205]:
docs_token_dct

{0: [0, 1, 2, 3, 4, 5], 1: [0, 1, 2, 6, 7, 8, 4, 5], 2: [1, 2, 9, 3, 10, 5]}

In [11]:
word_count_dct

{'this': 2,
 'one': 3,
 'time': 3,
 'at': 2,
 'band': 2,
 'camp': 3,
 'i': 1,
 'went': 1,
 'to': 1,
 'while': 1,
 'basketball': 1}

In [164]:
# sample output
#
#
# corpus = [
#    "this is document 1",
#    "this is document 2"
#]

#idx_to_word = {
#   0:"this",
#   1:"is",
#   2: "document",
#   3: "1",
#   4: "2"   
#}

#docs = {
#    1:[0,1,2,3],
#    2:[0,1,2,4]
#}

#word_count = {
#   "this":2,
#   "is":2,
#   "document":2,
#   "1":1,
#   "2":1   
#}

### Question 3
* make your own classification grid search
* make a function that takes in x and y data, a list of models to test and number of folds of cross validation
    * optional models are decision tree, knn, logistic regression 
    * use error handling to make sure the list contains selections from the above 3
    * you can use set().issubset() to achieve this
* you can use default params for the models
* run cross validation for each model
* once the best model is found, train a final model on all the data (best model as defined by CV)
* this can be chosen using the best mean results from the CV score
* return the results of the CV in a dictionary where the key is the model name the values are lists of cv scores
* also return the best model, the actual model object not the name of the model 
* test on the iris dataset, using the 4 features and class labels
* you can use SKlearn classes here

In [31]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

In [42]:
def my_grid_search(x,y,test_models,cv):
    assert set(test_models).issubset(set(["dt", "knn", "lr"])), "models aren't supported"
    
    # testing models
    return_dct = {}
    for mod_param in test_models:
        if mod_param == "dt":
            mod = DecisionTreeClassifier()
        elif mod == "lr":
            mod = LogisticRegression()
        else:
            mod = KNeighborsClassifier()
        
        return_dct[mod_param] = cross_val_score(mod, x, y, cv=cv)
        
    # grabbing best model
    best_model = test_models[np.argmax([np.mean(v) for k,v in return_dct.items()])]    
    if best_model == "dt":
        final_model = DecisionTreeClassifier().fit(x,y)
    elif best_model == "lr":
        final_model = LogisticRegression().fit(x,y)
    else:
        final_model = KNeighborsClassifier().fit(x,y)
            
    return return_dct, final_model

In [43]:
results, best_model = my_grid_search(iris_df.drop("label", 1), iris_df["label"], ["dt", "lr", "knn"], 3)



In [44]:
results

{'dt': array([0.98039216, 0.92156863, 1.        ]),
 'lr': array([0.98039216, 0.98039216, 1.        ]),
 'knn': array([0.98039216, 0.98039216, 1.        ])}

In [45]:
best_model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

### Question 4
* make a list of 1,000,000,000 random numbers(np.random.randint should be of help) with the lowest value being 2 and the highest being 5
* write a udf that will take a number and add 5 and return the nunmber
* use the multiprocessing library to map this udf to the list, running 4 workers
* note, due to the way jupyter notebooks locks the interpretor, this won't be runable in a notebook.  create a .py file in a text editor and run it using "python file.py"
* paste the code in here
* when you run the script, print the runtime to the console and take a screenshot of the console output and turn that in with your pdf notebook.

In [12]:
import multiprocessing
import time

def f(x):
    return x+5

if __name__ == "__main__":
    
    start = time.time()
    
    vect = list(range(10000000,2,5))
    pool = multiprocessing.Pool(processes=4)
    results = pool.map(f, vect)
    
    print("Runtime:{}".format(time.time() - start))

Runtime:0.4063589572906494


### Question 5
* run a decision tree regression on the boston dataset
* create a training and test set
* build the model on the train
* and test it on the test set
* use "RAD", "ZN", "CRIM", "AGE" as your features
* print your training and test MSE and RMSE (note there eis no rmse in sklearn)
* you can use Sklearn classes and functions here
* round RMSE and MSE to 3 decimal places using round()

In [3]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

In [4]:
boston_df.drop("y", 1).head(5)

Unnamed: 0,RAD,ZN,CRIM,AGE
0,1.0,18.0,0.00632,65.2
1,2.0,0.0,0.02731,78.9
2,2.0,0.0,0.02729,61.1
3,3.0,0.0,0.03237,45.8
4,3.0,0.0,0.06905,54.2


In [5]:
boston_df["y"].head()

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: y, dtype: float64

In [168]:
x_train, x_test, y_train, y_test = train_test_split(boston_df.drop("y", 1), 
                                                    boston_df["y"], test_size=0.80, random_state=42)

dt = DecisionTreeRegressor().fit(x_train, y_train)
yhat_train = dt.predict(x_train)
yhat_test = dt.predict(x_test)

train_mse = mean_squared_error(y_train, yhat_train)
test_mse = mean_squared_error(y_test, yhat_test)

train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)

print("Train MSE:{} : Train RMSE:{}".format(round(train_mse,3), round(train_rmse,3)))
print("Test MSE:{} : Test RMSE:{}".format(round(test_mse,3), round(test_rmse,3)))

Train MSE:0.0 : Train RMSE:0.0
Test MSE:113.694 : Test RMSE:10.663


### Question 6
* use the iris dataset to make a decision tree classifier (on a training set)
* use the trained model to make predictions for the test set
* for the test set, create a dataframe that has actual class, pred class and pred probability for each class
* create a confusion matrix and classification report for the model, using predicted and actual class values of the test set
* make sure the confusion matrix is in a dataframe where the columns and index values are the class labels
* you can use sklearn classes and functions here
* also print out a classification report
* model.classes_ will give you the ordering of the classes, since sklearn most of the time return index values of labeles and not actual labels

In [92]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [65]:
x_train, x_test, y_train, y_test = train_test_split(iris_df.drop("label", 1), 
                                                    iris_df["label"], test_size=0.33, random_state=42)

In [78]:
dt = DecisionTreeClassifier().fit(x_train, y_train)
yhat_test = dt.predict(x_test)
yhat_test_prob = dt.predict_proba(x_test)

In [85]:
prob_df = pd.DataFrame(yhat_test_prob, columns = ["prob_"+x for x in dt.classes_])
prob_df["yhat_test"] = list(yhat_test)
prob_df["y"] = list(y_test)

In [86]:
prob_df.shape

(50, 5)

In [87]:
prob_df.head(5)

Unnamed: 0,prob_setosa,prob_versicolor,prob_virginica,yhat_test,y
0,0.0,1.0,0.0,versicolor,versicolor
1,1.0,0.0,0.0,setosa,setosa
2,0.0,0.0,1.0,virginica,virginica
3,0.0,1.0,0.0,versicolor,versicolor
4,0.0,1.0,0.0,versicolor,versicolor


In [91]:
cm = pd.DataFrame(confusion_matrix(y_test, yhat_test), columns = dt.classes_, index = dt.classes_)
cm

Unnamed: 0,setosa,versicolor,virginica
setosa,19,0,0
versicolor,0,15,0
virginica,0,1,15


In [93]:
print(classification_report(y_test, yhat_test, labels=dt.classes_))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       0.94      1.00      0.97        15
   virginica       1.00      0.94      0.97        16

    accuracy                           0.98        50
   macro avg       0.98      0.98      0.98        50
weighted avg       0.98      0.98      0.98        50



### Question 7
* no need for a train and test on this one
* use the boston housing dataset to create a regression model
* use "RAD", "ZN", "CRIM", "AGE" as your features
* print out the betas
* now normalize the feaetures to 0-1 scale.  
* train another model using the normalized features
* print out the new betas
* make a dataframe that has the actual and predicted value for each observation for each model
* manually make columns that has the Error, Squared Error, Root Squared Error and Absolute Error for each observation for each model
* show the descriptive statistics for the dataframe
* you can use sklearn classes and functions, except to genereate the MSE and RMSE

In [16]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler

In [19]:
x = boston_df.drop("y",1)
y = boston_df["y"]
lr = LinearRegression().fit(x,y)
for i in zip (x.columns, lr.coef_):
    print(i)

('RAD', -0.12271398121068501)
('ZN', 0.0815894304234554)
('CRIM', -0.24600144270790836)
('AGE', -0.040836222859710586)


In [20]:
scaler = MinMaxScaler().fit(x)
x_norm = scaler.transform(x)
lr_norm = LinearRegression().fit(x_norm,y)
for i in zip (x.columns, lr_norm.coef_):
    print(i)

('RAD', -2.822421567845762)
('ZN', 8.158943042345527)
('CRIM', -21.88671883754948)
('AGE', -3.96519723967789)


In [175]:
yhat = lr.predict(x)
yhat_norm = lr_norm.predict(x)

score_df = pd.DataFrame({
    "y":y,
    "yhat":yhat,
    "yhat_norm":yhat_norm
})

In [176]:
score_df["error"] = score_df["y"] - score_df["yhat"]
score_df["error_abs"] = abs(score_df["error"])
score_df["mse"] = (score_df["y"] - score_df["yhat"])**2
score_df["rmse"] = np.sqrt(score_df["mse"])

score_df["error_norm"] = score_df["y"] - score_df["yhat_norm"]
score_df["error_abs_norm"] = abs(score_df["error_norm"])
score_df["mse_norm"] = (score_df["y"] - score_df["yhat_norm"])**2
score_df["rmse_norm"] = np.sqrt(score_df["mse_norm"])

In [177]:
score_df.head(5)

Unnamed: 0,y,yhat,yhat_norm,error,error_abs,mse,rmse,error_norm,error_abs_norm,mse_norm,rmse_norm
0,24.0,25.148591,-88.406553,-1.148591,1.148591,1.319261,1.148591,112.406553,112.406553,12635.233208,112.406553
1,21.6,22.992647,-292.872554,-1.392647,1.392647,1.939466,1.392647,314.472554,314.472554,98892.987199,314.472554
2,34.7,23.719537,-222.291605,10.980463,10.980463,120.570571,10.980463,256.991605,256.991605,66044.685229,256.991605
3,33.4,24.220367,-164.557694,9.179633,9.179633,84.265654,9.179633,197.957694,197.957694,39187.248495,197.957694
4,36.2,23.86832,-198.668155,12.33168,12.33168,152.070336,12.33168,234.868155,234.868155,55163.050402,234.868155


In [178]:
score_df.describe()

Unnamed: 0,y,yhat,yhat_norm,error,error_abs,mse,rmse,error_norm,error_abs_norm,mse_norm,rmse_norm
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,22.532806,22.532806,-259.014296,-4.756845e-15,5.524858,62.861889,5.524858,281.547102,390.783068,224811.9,390.783068
std,9.197104,4.647618,386.39292,7.936395,5.692264,154.049017,5.692264,381.878734,268.78114,421444.3,268.78114
min,5.0,-2.119487,-2353.312739,-14.91661,0.010497,0.00011,0.010497,-704.451014,1.399489,1.958569,1.399489
25%,17.025,19.753366,-436.513706,-4.671013,2.021684,4.087365,2.021684,113.664829,217.341223,47246.66,217.341223
50%,21.2,22.636139,-298.36748,-1.937402,3.918629,15.355708,3.918629,322.543012,364.42999,132809.5,364.42999
75%,25.0,24.73208,-88.5089,2.11131,6.976617,48.673361,6.976617,466.388323,519.774617,270166.0,519.774617
max,50.0,33.277628,737.351014,32.83315,32.833146,1078.015445,32.833146,2363.712739,2363.712739,5587138.0,2363.712739


### Question 8
* create a dataframe where the columns are words (tokens) and the rows sentencese and the cells the count of eaech word
* create a second dataframe wheree the cells are 1s and 0s, 1 if the the word is present, 0 if it isn't

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter
import pandas as pd
from itertools import chain

In [2]:
data = [
    "this this is my first name",
    "my first name is brian craft",
    "there there once once was a dog with the name richard",
    "once in a blue blue moon moon, a dog catches a fly",
    "my first mate went to Europe Europe"
]

In [41]:
# breakdown
# .replace(",","") get's ride of our commas
# .lower() remove the case sensativity in Europe Europe
# .split(" ") splits each string into a list of words
# we can string the above 3 together and end up with a list of words

# putting that list into Counter() and to get counts and immediatly make that a dictionary
# from_dict() will take a list of dicitionarys where each dictionary is a rows, each key is a column
    # and the value in the dict is the cell value
df_v1 = pd.DataFrame.from_dict([dict(Counter(x.replace(",","").lower().split(" "))) for x in data]).fillna(0)
df_v1

Unnamed: 0,this,is,my,first,name,brian,craft,there,once,was,...,richard,in,blue,moon,catches,fly,mate,went,to,europe
0,2.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,2.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,2.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,2.0


In [55]:
lst = [dict(Counter(x.replace(",","").lower().split(" "))) for x in data]
lst = [list(zip([idx for _ in range(len(x.keys()))], list(x.keys()), x.values())) for idx,x in enumerate(lst)]
lst = list(chain.from_iterable(lst))
df_v2 = pd.DataFrame(lst, columns = ["doc", "token", "count"]).pivot(index = "doc", columns = "token", 
                                                                     values = "count")
df_v2.fillna(0)

token,a,blue,brian,catches,craft,dog,europe,first,fly,in,...,name,once,richard,the,there,this,to,was,went,with
doc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,2.0,1.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0
3,3.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


In [38]:
df_v1.applymap(lambda x: 1 if x > 0 else 0)

Unnamed: 0,this,is,my,first,name,brian,craft,there,once,was,...,richard,in,blue,moon,catches,fly,mate,went,to,europe
0,1,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,1,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,1,1,1,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1,0,...,0,1,1,1,1,1,0,0,0,0
4,0,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,1,1


In [56]:
df_v2.applymap(lambda x: 1 if x > 0 else 0)

token,a,blue,brian,catches,craft,dog,europe,first,fly,in,...,name,once,richard,the,there,this,to,was,went,with
doc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,1,0,0,0,0
1,0,0,1,0,1,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,1,0,0,0,0,...,1,1,1,1,1,0,0,1,0,1
3,1,1,0,1,0,1,0,0,1,1,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,1,0,0,...,0,0,0,0,0,0,1,0,1,0


In [36]:
vectorizer = CountVectorizer().fit(data)
results = vectorizer.fit_transform(data)
df = pd.DataFrame(results.toarray(), columns = vectorizer.get_feature_names())
df

Unnamed: 0,blue,brian,catches,craft,dog,europe,first,fly,in,is,...,name,once,richard,the,there,this,to,was,went,with
0,0,0,0,0,0,0,1,0,0,1,...,1,0,0,0,0,2,0,0,0,0
1,0,1,0,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,1,2,1,1,2,0,0,1,0,1
3,2,0,1,0,1,0,0,1,1,0,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,2,1,0,0,0,...,0,0,0,0,0,0,1,0,1,0


In [148]:
vectorizer = CountVectorizer(binary = True).fit(data)
results = vectorizer.fit_transform(data)
df = pd.DataFrame(results.toarray(), columns = vectorizer.get_feature_names())
df

Unnamed: 0,blue,brian,catches,craft,dog,europe,first,fly,in,is,...,name,once,richard,the,there,this,to,was,went,with
0,0,0,0,0,0,0,1,0,0,1,...,1,0,0,0,0,1,0,0,0,0
1,0,1,0,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,1,1,1,1,1,0,0,1,0,1
3,1,0,1,0,1,0,0,1,1,0,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,1,0,0,0,...,0,0,0,0,0,0,1,0,1,0


### Quesetion 9
* What is something you could do with a dataframe like this?

* If we had class labels, we could perform any classification problem.
* We could cluster
* If, for instance, each document was an article, we could get the clicks for each article online and run a regression.