# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [None]:
!conda install -c conda-forge scikit-learn

Collecting package metadata: done
Solving environment: | 

In [65]:
# import libraries
import re
import pandas as pd
import numpy as np
#from sqlalchemy import create_engine
import sqlite3

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
#from sklearn.compose import ColumnTransformer

In [35]:
pd.__version__

'1.3.4'

In [37]:
import sklearn
sklearn.show_versions()


System:
    python: 3.10.0 (tags/v3.10.0:b494f59, Oct  4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]
executable: C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\python.exe
   machine: Windows-10-10.0.19044-SP0

Python dependencies:
      sklearn: 1.1.1
          pip: 22.3.1
   setuptools: 57.4.0
        numpy: 1.21.2+mkl
        scipy: 1.8.1
       Cython: None
       pandas: 1.3.4
   matplotlib: 3.6.3
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: libiomp
       filepath: C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\Lib\site-packages\numpy\DLLs\libiomp5md.dll
        version: None
    num_threads: 8

       user_api: blas
   internal_api: mkl
         prefix: mkl_rt
       filepath: C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\Lib\site-packages\numpy\DLLs\mkl_rt.1.dll
        version: 2021.4-Product
threading_layer: intel
    num_threads: 4

     

In [4]:
import nltk
nltk.download(['punkt', 'wordnet'])

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\M.Hedia\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\M.Hedia\AppData\Roaming\nltk_data...


True

In [38]:
# load data from database
#engine = create_engine('sqlite:///DisasterResponse.db')
#df = pd.read_sql_table('MessageCategory', engine)

# connect to the database
conn = sqlite3.connect('DisasterResponse.db')

# run a query
df=pd.read_sql('SELECT * FROM MessageCategory', conn)

#df['child_alone'].iloc[0]=1
#X = df[['message', 'genre']]#.values
X = df[['message']]#.values
Y = df.iloc[:,4:]#.values
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [39]:
Y.shape

(26028, 36)

In [40]:
X.shape

(26028, 1)

In [41]:
X['message'].shape

(26028,)

In [42]:
X['genre'].shape

KeyError: 'genre'

In [12]:
Y.shape

(26028, 36)

In [43]:
single_value_targets=[]
target_cols=list(df.columns[4:])
for col in target_cols:#df.iloc[:,4:].columns:
    if len(df[col].unique())==1:
        print(col, df[col].unique(), len(df[col].unique()), target_cols.index(col))
        single_value_targets.append(target_cols.index(col))

child_alone [0] 1 9


In [44]:
df.child_alone.unique()

array([0], dtype=int64)

In [45]:
df.related.unique()

array([1, 0], dtype=int64)

In [46]:
df.genre.unique()

array(['direct', 'social', 'news'], dtype=object)

### 2. Write a tokenization function to process your text data

In [47]:
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

def tokenize(text):
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [57]:
rfc_pipeline = Pipeline([('selector', ItemSelector(key='message')),
                         ('vect', CountVectorizer(tokenizer=tokenize)),
                        ('tdfidf', TfidfTransformer()),
                        ('clf', MultiOutputClassifier(RandomForestClassifier()))])

In [58]:
lr_pipeline = Pipeline([('selector', ItemSelector(key='message')),
                        ('vect', CountVectorizer(tokenizer=tokenize)),
                        ('tdfidf', TfidfTransformer()),
                        ('clf', MultiOutputClassifier(LogisticRegression()))])#multi_class='multinomial', solver='lbfgs'

In [26]:
class ItemSelector(BaseEstimator, TransformerMixin):
    """For data grouped by feature, select subset of data at a provided key.

    The data is expected to be stored in a 2D data structure, where the first
    index is over features and the second is over samples.  i.e.

    >> len(data[key]) == n_samples

    Please note that this is the opposite convention to scikit-learn feature
    matrixes (where the first index corresponds to sample).

    ItemSelector only requires that the collection implement getitem
    (data[key]).  Examples include: a dict of lists, 2D numpy array, Pandas
    DataFrame, numpy record array, etc.

    >> data = {'a': [1, 5, 2, 5, 2, 8],
               'b': [9, 4, 1, 4, 1, 3]}
    >> ds = ItemSelector(key='a')
    >> data['a'] == ds.transform(data)

    ItemSelector is not designed to handle data grouped by sample.  (e.g. a
    list of dicts).  If your data is structured this way, consider a
    transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.

    Parameters
    ----------
    key : hashable, required
        The key corresponding to the desired value in a mappable.
    """
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        if self.key=='genre':
            return data_dict[[self.key]]
        else:
            return data_dict[self.key]


In [27]:
message_transformer = Pipeline([
    ('selector', ItemSelector(key='message')),
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tdfidf', TfidfTransformer())])

In [28]:
genre_transformer = Pipeline([
    ('selector', ItemSelector(key='genre')),
    ('onehot', OneHotEncoder())])

In [29]:
msg_gnre_pipeline=Pipeline([
    ('features', FeatureUnion([
        ('message_pipe', message_transformer),
        ('genre_pipe', genre_transformer)
    ])),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [31]:
X_train.head()

Unnamed: 0,message,genre
15272,The failure of rescuers to reach these areas p...,news
15797,"NEW YORK, 21 September 2007 - The exceptionall...",news
7072,Human rights group Be a pillar on which those ...,direct
3261,Please send us information concerning them.,direct
23655,In deference to the concerns of the Malian gov...,news


In [32]:
msg_gnre_pipeline.fit(X_train, y_train)

In [34]:
target_names=list(df.columns[4:])

y_pred_msg_gnre_rfc = msg_gnre_pipeline.predict(X_test)

report_msg_gnre_rfc = classification_report(y_test, y_pred_msg_gnre_rfc, target_names=target_names)
print(report_msg_gnre_rfc)

                        precision    recall  f1-score   support

               related       0.81      0.97      0.88      3959
               request       0.90      0.49      0.63       902
                 offer       0.00      0.00      0.00        25
           aid_related       0.79      0.63      0.70      2156
          medical_help       0.67      0.04      0.08       431
      medical_products       0.84      0.08      0.15       264
     search_and_rescue       0.71      0.03      0.06       151
              security       0.50      0.01      0.02       106
              military       0.82      0.05      0.10       175
           child_alone       0.00      0.00      0.00         0
                 water       0.91      0.26      0.40       344
                  food       0.85      0.44      0.58       586
               shelter       0.89      0.23      0.37       487
              clothing       0.69      0.11      0.20        79
                 money       0.88      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [48]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [49]:
y_train.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
15272,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15797,1,0,0,1,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,0
7072,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3261,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23655,1,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [50]:
# Revise the first element of y_train in the target columns that have only one class of 0,
# to avoid getting errors with some models such as LogisticRegression.
for i in single_value_targets:
    #y_train[0,i]=1
    y_train.iloc[0,i]=1

In [52]:
y_train['child_alone'].head()

15272    1
15797    0
7072     0
3261     0
23655    0
Name: child_alone, dtype: int64

In [59]:
rfc_pipeline.fit(X_train, y_train)

In [60]:
lr_pipeline.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [61]:
target_names=list(df.columns[4:])

In [62]:
# predict on test data
y_pred_rfc = rfc_pipeline.predict(X_test)

report_rfc = classification_report(y_test, y_pred_rfc, target_names=target_names)
print(report_rfc)
#print(classification_report(np.hstack(y_test), np.hstack(y_pred)))

                        precision    recall  f1-score   support

               related       0.81      0.98      0.88      3959
               request       0.89      0.45      0.60       902
                 offer       0.00      0.00      0.00        25
           aid_related       0.79      0.63      0.70      2156
          medical_help       0.68      0.04      0.07       431
      medical_products       0.81      0.05      0.09       264
     search_and_rescue       0.60      0.02      0.04       151
              security       0.50      0.01      0.02       106
              military       0.83      0.03      0.06       175
           child_alone       0.00      0.00      0.00         0
                 water       0.91      0.20      0.32       344
                  food       0.84      0.44      0.58       586
               shelter       0.88      0.26      0.40       487
              clothing       0.64      0.09      0.16        79
                 money       1.00      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [63]:
# predict on test data
y_pred_lr = lr_pipeline.predict(X_test)
report_lr = classification_report(y_test, y_pred_lr, target_names=target_names)
print(report_lr)
#print(classification_report(np.hstack(y_test), np.hstack(y_pred)))

                        precision    recall  f1-score   support

               related       0.84      0.96      0.89      3959
               request       0.82      0.59      0.69       902
                 offer       0.00      0.00      0.00        25
           aid_related       0.76      0.68      0.72      2156
          medical_help       0.67      0.18      0.28       431
      medical_products       0.79      0.20      0.33       264
     search_and_rescue       0.92      0.08      0.15       151
              security       0.00      0.00      0.00       106
              military       0.73      0.11      0.19       175
           child_alone       0.00      0.00      0.00         0
                 water       0.78      0.51      0.62       344
                  food       0.81      0.62      0.71       586
               shelter       0.84      0.45      0.58       487
              clothing       0.76      0.20      0.32        79
                 money       0.81      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [15]:
(pd.DataFrame(y_pred_lr, columns=target_names)).related.unique()

array([1, 0])

In [16]:
(pd.DataFrame(y_test, columns=target_names)).related.unique()

array([0, 1])

In [17]:
pd.DataFrame(y_pred_lr, columns=target_names).sum(axis=0)

related                   4532
request                    654
offer                        0
aid_related               1923
medical_help               115
medical_products            68
search_and_rescue           13
security                     0
military                    26
child_alone                  0
water                      225
food                       450
shelter                    260
clothing                    21
money                       16
missing_people               1
refugees                    15
death                       62
other_aid                  159
infrastructure_related      23
transport                   33
buildings                   82
electricity                 18
tools                        0
hospitals                    1
shops                        0
aid_centers                  0
other_infrastructure        10
weather_related           1139
floods                     185
storm                      309
fire                         4
earthqua

In [18]:
pd.DataFrame(y_test, columns=target_names).sum(axis=0)

related                   3959
request                    902
offer                       25
aid_related               2156
medical_help               431
medical_products           264
search_and_rescue          151
security                   106
military                   175
child_alone                  0
water                      344
food                       586
shelter                    487
clothing                    79
money                      131
missing_people              55
refugees                   160
death                      225
other_aid                  667
infrastructure_related     317
transport                  228
buildings                  291
electricity                101
tools                       30
hospitals                   49
shops                       23
aid_centers                 56
other_infrastructure       216
weather_related           1467
floods                     414
storm                      516
fire                        55
earthqua

In [19]:
(pd.DataFrame(y_pred_lr, columns=target_names)['aid_related']==pd.DataFrame(y_test, columns=target_names)['aid_related']).sum()

4061

In [20]:
TfidfTransformer().fit_transform(CountVectorizer().fit_transform(X_train)).toarray()

array([[ 0.        ,  0.11996291,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ..., 
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

### 6. Improve your model
Use grid search to find better parameters. 

In [21]:
lr_pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7f2bcbf1ad90>, vocabulary=None)),
  ('tdfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
             intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
             penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
             verbose=0, warm_start=False),
              n_jobs=1))],
 'vect': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
         dty

In [36]:
class DataFrameMessageExtracter(TransformerMixin):

    def __init__(self, column):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        print(X.shape)
        return X.iloc[self.column]

In [20]:
preprocessor = ColumnTransformer([('message', message_transformer, [0]),
                                 ('genre', genre_transformer, [1])])

NameError: name 'ColumnTransformer' is not defined

In [37]:
test_pipeline = Pipeline([('extractcolumn', DataFrameMessageExtracter(0)),
                        ('vect', CountVectorizer(tokenizer=tokenize)),
                        ('tdfidf', TfidfTransformer()),
                        ('clf', MultiOutputClassifier(RandomForestClassifier()))])

In [37]:
test_pipeline = Pipeline([('extractcolumn', DataFrameMessageExtracter(0)),
                        ('vect', CountVectorizer(tokenizer=tokenize)),
                        ('tdfidf', TfidfTransformer()),
                        ('clf', MultiOutputClassifier(RandomForestClassifier()))])

In [38]:
X

0        Weather update - a cold front from Cuba that c...
1                  Is the Hurricane over or is it not over
2                          Looking for someone but no name
3        UN reports Leogane 80-90 destroyed. Only Hospi...
4        says: west side of Haiti, rest of the country ...
5                   Information about the National Palace-
6                           Storm at sacred heart of jesus
7        Please, we need tents and water. We are in Sil...
8          I would like to receive the messages, thank you
9        I am in Croix-des-Bouquets. We have health iss...
10       There's nothing to eat and water, we starving ...
11       I am in Petionville. I need more information r...
12       I am in Thomassin number 32, in the area named...
13       Let's do it together, need food in Delma 75, i...
14       More information on the 4636 number in order f...
15       A Comitee in Delmas 19, Rue ( street ) Janvier...
16       We need food and water in Klecin 12. We are dy.

In [39]:
test_pipeline.fit(X_train, y_train)

(20822,)


ValueError: Iterable over raw text documents expected, string object received.

In [70]:
parameters = {
    'features__message_pipe__vect__ngram_range': ((1, 1), (1, 2)),
    'clf__estimator__n_estimators': [50, 100, 200],
    'clf__estimator__min_samples_split': [2, 3, 4]
}

# create grid search object
cv = GridSearchCV(msg_gnre_pipeline, parameters, verbose=4)
cv.fit(X_train, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV 1/5] END clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50, features__message_pipe__vect__ngram_range=(1, 1);, score=nan total time=   5.2s
[CV 2/5] END clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50, features__message_pipe__vect__ngram_range=(1, 1);, score=nan total time=   5.7s
[CV 3/5] END clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50, features__message_pipe__vect__ngram_range=(1, 1);, score=nan total time=   5.4s
[CV 4/5] END clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50, features__message_pipe__vect__ngram_range=(1, 1);, score=nan total time=   5.3s
[CV 5/5] END clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50, features__message_pipe__vect__ngram_range=(1, 1);, score=nan total time=   5.2s
[CV 1/5] END clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50, features__message_pipe__vect__ngram_range

[CV 1/5] END clf__estimator__min_samples_split=3, clf__estimator__n_estimators=200, features__message_pipe__vect__ngram_range=(1, 1);, score=nan total time=   6.8s
[CV 2/5] END clf__estimator__min_samples_split=3, clf__estimator__n_estimators=200, features__message_pipe__vect__ngram_range=(1, 1);, score=nan total time=   7.2s
[CV 3/5] END clf__estimator__min_samples_split=3, clf__estimator__n_estimators=200, features__message_pipe__vect__ngram_range=(1, 1);, score=nan total time=   6.1s
[CV 4/5] END clf__estimator__min_samples_split=3, clf__estimator__n_estimators=200, features__message_pipe__vect__ngram_range=(1, 1);, score=nan total time=   5.9s
[CV 5/5] END clf__estimator__min_samples_split=3, clf__estimator__n_estimators=200, features__message_pipe__vect__ngram_range=(1, 1);, score=nan total time=   7.1s
[CV 1/5] END clf__estimator__min_samples_split=3, clf__estimator__n_estimators=200, features__message_pipe__vect__ngram_range=(1, 2);, score=nan total time=   8.4s
[CV 2/5] END clf

ValueError: 
All the 90 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
90 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\pipeline.py", line 378, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\pipeline.py", line 336, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\pipeline.py", line 870, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\pipeline.py", line 1154, in fit_transform
    results = self._parallel_func(X, y, fit_params, _fit_transform_one)
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\pipeline.py", line 1176, in _parallel_func
    return Parallel(n_jobs=self.n_jobs)(
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\parallel.py", line 1046, in __call__
    while self.dispatch_one_batch(iterator):
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\parallel.py", line 861, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\parallel.py", line 779, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\fixes.py", line 117, in __call__
    return self.function(*args, **kwargs)
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\pipeline.py", line 870, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\pipeline.py", line 414, in fit_transform
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\pipeline.py", line 336, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\pipeline.py", line 870, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\base.py", line 870, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "C:\Users\MED01~1.HED\AppData\Local\Temp/ipykernel_13568/3376786925.py", line 38, in transform
    return data_dict[[self.key]]
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\frame.py", line 3464, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexing.py", line 1314, in _get_listlike_indexer
    self._validate_read_indexer(keyarr, indexer, axis)
  File "C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexing.py", line 1374, in _validate_read_indexer
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['genre'], dtype='object')] are in the [columns]"


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.