# Organize ML projects with Scikit-Learn

While Machine Learning is powerful, people often overestimate it: apply machine learning to your project, and all your problems will be solved. In reality, it's not this simple. To be effective, one needs to organize the work very well. In this notebook, we will walkthrough practical aspects of a ML project. To look at the big picture, let's start with a checklist below. It should work reasonably well for most ML projects, but make sure to adapt it to your needs:

1. **Define the scope of work and objective**
    * How is your solution be used?
    * How should performance be measured? Are there any contraints?
    * How would the problem be solved manually?
    * List the available assumptions, and verify if possible.
    
    
2. **Get the data**
    * Document where you can get that data
    * Store data in a workspace you can easily access
    * Convert the data to a format you can easily manipulate
    * Check the overview (size, type, sample, description, statistics)
    * Data cleaning
    
    
3. **EDA & Data transformation**
    * Study each attribute and its characteristics (missing values, type of distribution, usefulness)
    * Visualize the data
    * Study the correlations between attributes
    * Feature selection, Feature Engineering, Feature scaling
    * Write functions for all data transformations
    
    
4. **Train models**
    * Automate as much as possible
    * Train promising models quickly using standard parameters. Measure and compare their performance
    * Analyze the errors the models make
    * Shortlist the top three of five most promising models, preferring models that make different types of errors.


5. **Fine-tunning**
    * Treat data transformation choices as hyperparameters, expecially when you are not sure about them (e.g., replace missing values with zeros or with the median value)
    * Unless there are very few hyperparameter value to explore, prefer random search over grid search.
    * Try ensemble methods
    * Test your final model on the test set to estimate the generalizaiton error. Don't tweak your model again, you would start overfitting the test set.

## Example: Articles categorization

### Objectives

Build a model to determine the categories of articles. 

### Get Data

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

sns.set_style("whitegrid")

In [3]:
bbc = pd.read_csv('https://raw.githubusercontent.com/dhminh1024/practice_datasets/master/bbc-text.csv')

In [4]:
bbc.sample(5)

Unnamed: 0,category,text
1829,business,wipro beats forecasts once again wipro india ...
1381,tech,millions to miss out on the net by 2025 40% o...
674,tech,the pirates with no profit motive two men who ...
1618,business,diageo to buy us wine firm diageo the world s...
1554,tech,gadget market to grow in 2005 the explosion ...


In [5]:
bbc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  2225 non-null   object
 1   text      2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


In [6]:
# Your code here
bbc.category.unique()

array(['tech', 'business', 'sport', 'entertainment', 'politics'],
      dtype=object)

In [7]:
text = bbc['text']
text

0       tv future in the hands of viewers with home th...
1       worldcom boss  left books alone  former worldc...
2       tigers wary of farrell  gamble  leicester say ...
3       yeading face newcastle in fa cup premiership s...
4       ocean s twelve raids box office ocean s twelve...
                              ...                        
2220    cars pull down us retail figures us retail sal...
2221    kilroy unveils immigration policy ex-chatshow ...
2222    rem announce new glasgow concert us band rem h...
2223    how political squabbles snowball it s become c...
2224    souness delight at euro progress boss graeme s...
Name: text, Length: 2225, dtype: object

In [8]:
category = bbc.category

In [9]:
category

0                tech
1            business
2               sport
3               sport
4       entertainment
            ...      
2220         business
2221         politics
2222    entertainment
2223         politics
2224            sport
Name: category, Length: 2225, dtype: object

In [14]:
random = np.random.randint(category.shape[0])
print(text.loc[random], category.loc[random])

o driscoll/gregan lead aid stars ireland s brian o driscoll will lead the northern hemisphere team in the irb rugby aid match at twickenham.  o driscoll heads a star-studded cast for the contest to raise funds for the tsunami appeal. the south will be led by george gregan  one of four wallabies  alongside five springboks and four all blacks including captain tana umaga. south african flanker schalk burger has shaken off a leg injury to take his place in the starting line-up. he will join fellow springboks john smit  cobus visagie and victor matfield in the south pack  with jacque fourie among the centres. the north side have been hit by the withdrawals of scotland duo gordon bulloch and chris cusiter  plus france captain fabien pelous.  but leicester s england centre ollie smith has been added to the squad  giving him an opportunity to impress lions coach sir clive woodward  who takes charge of the north side.  i think it s fantastic for ollie   tigers coach john wells told bbc radio l

In [15]:
import re 

def preprocessor(text):
    """ Return a cleaned version of text
    """
    # Remove HTML markup
    text = re.sub('<[^>]*>', '', text)
    # Save emoticons for later appending
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    # Remove any non-word character and append the emoticons,
    # removing the nose character for standarization. Convert to lower case
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
    
    return text


In [16]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

# Split a text into list of words
def tokenizer(text):
    return text.split()

# Split a text into list of words and apply stemming technic
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

['Hi', 'there,', 'I', 'am', 'loving', 'this,', 'like', 'with', 'a', 'lot', 'of', 'love']
['Hi', 'there,', 'I', 'am', 'love', 'this,', 'like', 'with', 'a', 'lot', 'of', 'love']


In [20]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [21]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(text, category, test_size=0.2, random_state=42)

In [95]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier


In [41]:
bow = CountVectorizer(stop_words = stop_words, tokenizer = tokenizer_porter, preprocessor= preprocessor)

In [None]:
tfidf = TfidfVectorizer(stop_words=stop_words,
                        tokenizer=tokenizer_porter,
                        preprocessor=preprocessor)

In [49]:
clf_logistic = Pipeline([('vect', bow),
                ('clf', LogisticRegression(random_state=0))])
clf_logistic.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1),
                                 preprocessor=<function preprocessor at 0x7fa46d44abf8>,
                                 stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourselves'...
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_porter at 0x7fa46d429f28>,
                                 vocabulary=None)),
                ('clf',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
          

In [119]:
clf_logistic = Pipeline([('vect', tfidf),
                ('clf', LogisticRegression(random_state=42, C= 1))])
clf_logistic.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=<function preprocessor at 0x7fa46d44abf8>,
                                 smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', '...
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_porter at 0x7fa46d429f28>,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 LogisticRegression(C=1, class_weight=None, dual=False,
           

In [120]:
y_train_pred = clf_logistic.predict(X_train)
y_test_pred = clf_logistic.predict(X_test)


In [26]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [121]:
print('accuracy:',accuracy_score(y_test,y_test_pred))
print('confusion matrix:\n',confusion_matrix(y_test,y_test_pred))
print('classification report:\n',classification_report(y_test,y_test_pred))

accuracy: 0.9685393258426966
confusion matrix:
 [[96  0  4  1  0]
 [ 2 78  0  0  1]
 [ 2  0 81  0  0]
 [ 0  0  0 98  0]
 [ 3  0  0  1 78]]
classification report:
                precision    recall  f1-score   support

     business       0.93      0.95      0.94       101
entertainment       1.00      0.96      0.98        81
     politics       0.95      0.98      0.96        83
        sport       0.98      1.00      0.99        98
         tech       0.99      0.95      0.97        82

     accuracy                           0.97       445
    macro avg       0.97      0.97      0.97       445
 weighted avg       0.97      0.97      0.97       445



In [57]:
clf_nbc = Pipeline([('vect', bow), ('clf', MultinomialNB())])
clf_nbc.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1),
                                 preprocessor=<function preprocessor at 0x7fa46d44abf8>,
                                 stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourselves'...
                                             "you're", "you've", "you'll",
                                             "you'd", 'your', 'yours',
                                             'yourself', 'yourselves', 'he',
                                             'him', 'his', 'himself', 'she',
                  

In [58]:
y_train_pred_nbc = clf_nbc.predict(X_train)
y_test_pred_nbc = clf_nbc.predict(X_test)

In [59]:
print(accuracy_score(y_test, y_test_pred_nbc))
print(confusion_matrix(y_test, y_test_pred_nbc))
print('classification report:\n',classification_report(y_test,y_test_pred_nbc))

0.9662921348314607
[[94  0  6  0  1]
 [ 1 74  1  0  5]
 [ 1  0 82  0  0]
 [ 0  0  0 98  0]
 [ 0  0  0  0 82]]
classification report:
                precision    recall  f1-score   support

     business       0.98      0.93      0.95       101
entertainment       1.00      0.91      0.95        81
     politics       0.92      0.99      0.95        83
        sport       1.00      1.00      1.00        98
         tech       0.93      1.00      0.96        82

     accuracy                           0.97       445
    macro avg       0.97      0.97      0.97       445
 weighted avg       0.97      0.97      0.97       445



In [92]:
clf_dt = Pipeline([('vect', tfidf),
                ('clf', DecisionTreeClassifier(min_samples_split= 5, min_samples_leaf=5))])
clf_dt.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=<function preprocessor at 0x7fa46d44abf8>,
                                 smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', '...
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                        criterion='gini', max_depth=None,
                                        max_features=None, max_leaf_nodes=None,
             

In [93]:
y_train_pred_dt = clf_dt.predict(X_train)
y_test_pred_dt = clf_dt.predict(X_test)

In [94]:
print(accuracy_score(y_train,y_train_pred_dt))
print(accuracy_score(y_test, y_test_pred_dt))
print(confusion_matrix(y_test, y_test_pred_dt))
print('classification report:\n',classification_report(y_test,y_test_pred_dt))

0.9342696629213483
0.8067415730337079
[[77  5 12  3  4]
 [ 9 60  2  6  4]
 [10  1 69  1  2]
 [ 3  2  6 86  1]
 [ 4  3  4  4 67]]
classification report:
                precision    recall  f1-score   support

     business       0.75      0.76      0.75       101
entertainment       0.85      0.74      0.79        81
     politics       0.74      0.83      0.78        83
        sport       0.86      0.88      0.87        98
         tech       0.86      0.82      0.84        82

     accuracy                           0.81       445
    macro avg       0.81      0.81      0.81       445
 weighted avg       0.81      0.81      0.81       445



In [104]:
clf_voting = VotingClassifier(estimators= [('loristic',clf_logistic),('nbc', clf_nbc),('dt', clf_dt)])
clf_voting.fit(X_train, y_train)
y_train_pred_voting = clf_voting.predict(X_train)
y_test_pred_voting = clf_voting.predict(X_test)

In [105]:
print(accuracy_score(y_train,y_train_pred_voting))
print(accuracy_score(y_test, y_test_pred_voting))
print(confusion_matrix(y_test, y_test_pred_voting))
print('classification report:\n',classification_report(y_test,y_test_pred_voting))

0.9977528089887641
0.9640449438202248
[[94  0  6  0  1]
 [ 3 75  0  0  3]
 [ 1  1 81  0  0]
 [ 0  0  0 98  0]
 [ 0  1  0  0 81]]
classification report:
                precision    recall  f1-score   support

     business       0.96      0.93      0.94       101
entertainment       0.97      0.93      0.95        81
     politics       0.93      0.98      0.95        83
        sport       1.00      1.00      1.00        98
         tech       0.95      0.99      0.97        82

     accuracy                           0.96       445
    macro avg       0.96      0.96      0.96       445
 weighted avg       0.96      0.96      0.96       445

