There is a lot of talk about sklearn's great documentation. There are 5 kinds of pages.  
* API - very high level overview of the function  
* Class Documentation  
* User guide - advice  
* Glossary: Every term that shows up in the documentation  

You might be more concerned with *percision* in a spam detector, more concerned about *recall* with a fraud detector.  

train_test_split is inferior to cross-validation. Use the cross-validation when you can.  

The reason that CountVectorizer requires things to be in a single dimension is that it supports operations that have two dimensions. An example is *multilable classification* where you can have documents that fall into several categories at once. For example a news article can at once be about science and medicine. 

Do the imports. Packages to import are 
> pandas  
> The package to make dummy variables  
> LogisticRegression  
> The package to transform the columns  
> The package to make a pipeline  

In [50]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

Make the list of columns we want in the model, the 'Parch' (the number of parents or siblings they had on the trip), the amount they paid to get on the boat, the port they got on the boat at, and thier gender. 

In [51]:
cols = ['Parch','Fare','Embarked','Sex']

Read the data into a Pandas DateFrame from the website abbreviated version of the kaggle website. Import the \
training set. Then assign the columns and the target variable to X and y, respectively. 

In [52]:
url = 'http://bit.ly/kaggletrain'
df = pd.read_csv(url, nrows=10)
X = df[cols]
y = df['Survived']

Download and assign the testing set to the X_new variable.

In [53]:
url = 'http://bit.ly/kaggletest'
df_new = pd.read_csv(url, nrows=10)
X = df_new[cols]

Instantiate the OneHotEncoder. There is an option in OneHotEncoder drop='first' to eliminate multicolinearity. Not really an issue in skearn models in the real world. It also limits you flexibility in handling unknown features. Also makes using standardization problematic. 

In [54]:
ohe = OneHotEncoder()

Construct and instatiate the column transformer. 

In [55]:
ct = make_column_transformer(
        (ohe, ['Embarked','Sex']),
        remainder='passthrough')

Run the column transformer to test.

In [56]:
ct.fit(X)

ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('onehotencoder',
                                 OneHotEncoder(categorical_features=None,
                                               categories=None, drop=None,
                                               dtype=<class 'numpy.float64'>,
                                               handle_unknown='error',
                                               n_values=None, sparse=True),
                                 ['Embarked', 'Sex'])],
                  verbose=False)

Create the logistic Regression model and assign a solver for small data sets. The solver is the algorythm used to solve the optimization problem. There are 5 different solvers in sklearn. There is a chart in the optimization. If you get a convergence warning is often a solver problem. Anytime there is a psuedo random state in solver (as there is in two of sklearn's logistic regression) specify a random state for reproducability. 

LogisticRegressionCV--doesn't integrate well with the rest of sklearn. 



In [57]:
logreg = LogisticRegression(solver='liblinear')

Make a pipeline and the fit the new model and make predictions.

In [58]:
pipe = make_pipeline(ct, logreg)

In [59]:
pipe.fit(X, y)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(categorical_features=None,
                                                                categories=None,
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='error',
                                                                n_values=None,
                                                                sparse=True),
                                                  ['Embarked', 'Sex'])],
                      

Examine the coefficients.

In [60]:
pipe.named_steps.logisticregression.coef_

array([[ 0.38221491, -0.34343659,  0.09558606,  0.03207038,  0.10229399,
        -0.70656922, -0.00946779]])

Now use text data. First immport the text tranformer module, instantiate it, and vectorize the variable 'Name' into a document-test matrix. Remember that CountVectorizer expects one-dimensional input.

In [61]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit_transform(df['Name'])

<10x40 sparse matrix of type '<class 'numpy.int64'>'
	with 46 stored elements in Compressed Sparse Row format>

The reason that CountVectorizer takes a one dimensional object is because sklearn learn supports 'multilable classification' of documents which requires a 2 dimensional y.

Examine the feature names.

In [62]:
print(vect.get_feature_names())

['achem', 'adele', 'allen', 'berg', 'bradley', 'braund', 'briggs', 'cumings', 'elisabeth', 'florence', 'futrelle', 'gosta', 'harris', 'heath', 'heikkinen', 'henry', 'jacques', 'james', 'john', 'johnson', 'laina', 'leonard', 'lily', 'master', 'may', 'mccarthy', 'miss', 'moran', 'mr', 'mrs', 'nasser', 'nicholas', 'oscar', 'owen', 'palsson', 'peel', 'thayer', 'timothy', 'vilhelmina', 'william']


Put the data into a DataFrame. 

In [63]:
pd.DataFrame(data=vect.fit_transform(df['Name']).toarray(), columns=vect.get_feature_names())

Unnamed: 0,achem,adele,allen,berg,bradley,braund,briggs,cumings,elisabeth,florence,...,nasser,nicholas,oscar,owen,palsson,peel,thayer,timothy,vilhelmina,william
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,1,0,1,1,0,1,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
8,0,0,0,1,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0
9,1,1,0,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0


Update X to include 'Name'

In [64]:
cols.append('Name')
X = df[cols]
X_new = df[cols]

Update the ColumnTransformer

In [65]:
ct = make_column_transformer(
        (ohe, ['Embarked','Sex']),
        (vect, 'Name'),
        remainder='passthrough')

Run the ColumnTransformer

In [66]:
ct.fit_transform(X).toarray()

array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  0.    ,
         0.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    ,  0.    ,
         0.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    ,  0.    ,
         0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,
         0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  1.    ,  0.    ,
         0.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    ,  0.    ,
         0.    ,  0.    ,  0.    ,  0.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    ,  0.    ,
         0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  1.    ,  0.    ,
         1.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,
         0.    ,  0.    ,  1.    ,  0.    ,  0.    ,  0.    ,  0.    ,
         0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  1.    ,
         0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  1.    ,
         0.    ,  0.    ,  0.    ,  0.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.  

Now, update the pipe.

In [67]:
pipe = make_pipeline(ct, logreg)

Fit the pipeline and examine the steps. 

In [68]:
pipe.fit(X, y)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(categorical_features=None,
                                                                categories=None,
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='error',
                                                                n_values=None,
                                                                sparse=True),
                                                  ['Embarked', 'Sex']),
                       

Update the testing data frame to include the new column. 

In [70]:
pipe.predict(X_new)

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1])

Make predictions.