# Overview

The goal of this project is to perform 

The goal of this project is to perform a sentiment analysis of Apple customers, and uncover actionable insight that could be used to optimize a marketing strategy going forward. To achieve this, we built a predictive model using Natural Language Processing (NLP), that could rate the sentiment of a tweet based on its content. At the end of our analysis, we present the findings of our model and provide concrete recommendations as to how Apple could improve its marketing strategy going forward and ultimately increase customer satisfaction.

# Business Understanding

Developing an excellent marketing strategy is crucial 


Developing an excellent marketing strategy is crucial for an organization to consistently achieve positive results. To perform effective marketing, companies need to gain a deep understanding of their customers and uncover what matters to them most. The challenge is figuring out how to gain this insight in an efficient manner, and how to consistently implement meaningful change. Fortunately, machine learning provides us with unique and effective tools to perform customer sentiment analysis and guide long-term decision making.

# Data Understanding


For this analysis, I utilized tweet data from ~136K tweets from 605 Twitter accounts that were pulled from the Twitter API.  These accounts were manually selected by me to represent each account class that I am trying to predict.    

## Constants

Model constants

In [1]:
model1_max_features = 1250
model1_test_size = .4

model2_max_features = 1250
model2_test_size = .4

model1_min_df = 3
model2_min_df = 3

## Imports / settings

In [2]:
# General imports
import string
import pickle

# Twitter import
import tweepy

# Analysis imports
import pandas as pd
import numpy as np

# NLP imports
from nltk.stem import WordNetLemmatizer 
from nltk import pos_tag
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import regexp_tokenize, word_tokenize, RegexpTokenizer

# SKlearn imports
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import ComplementNB
from sklearn.preprocessing import LabelEncoder
from sklearn.dummy import DummyClassifier

# Visualization imports
import matplotlib.pyplot as plt
import seaborn as sns

# Pandas settings
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 90

# Downloads (for NLP)
import nltk
nltk.download('wordnet')
nltk.download('tagsets')
nltk.download('averaged_perceptron_tagger');

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\natek\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\natek\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\natek\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Functions

These are helper functions that assist in the manipulation of tweet strings for pre-processing purposes.

In [3]:
def strip_rt_user(text):
    if text[0:2] == "RT":
        colon = text.find(":")
        return text[colon+1:].lower()
    else:
        return text.lower()

def get_rt_user(text):
    if text[0:2] == "RT":
        colon = text.find(":")
        user = text[:colon]
        at = user.find("@")
        return (user[at+1:]).lower()
    else:
        return ""

def addHashTags(text):
    return "#" + text + "#"

# Translate nltk POS to wordnet tags
def get_wordnet_pos(treebank_tag):
    '''
    Translate nltk POS to wordnet tags
    '''
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def remove_characters(text, char_to_remove):
    str1 = ''.join(x for x in text if not x in char_to_remove)
    return str1

def remove_punctuation(text):
    text = remove_characters(text, string.punctuation)
    return text

def tag_and_lemmatize(text):
    newText = text
    newText = pos_tag(newText)
    newText = [(x[0], get_wordnet_pos(x[1])) for x in newText]
    lemma = nltk.stem.WordNetLemmatizer()
    newText = [(lemma.lemmatize(x[0], x[1])) for x in newText]
    return newText

def dummy_fun(doc):
    return doc

# perform all pre-processing on a df
def preprocessing(df):
    preprocessing_01_model_specific(df)
    preprocessing_02_general(df)
    preprocessing_03_tag_and_lemmatize(df)
    
    
def preprocessing_01_model_specific(df):
    # Copy the RT user name from the text column and put it into a different column.
    df['RT_user'] = df['text'].apply(get_rt_user)
    df['RT_user'] = df['RT_user'].apply(lambda x: addHashTags(x) if x != "" else "")

    # Pull out the RT user name from the text column
    df['text'] = df['text'].apply(strip_rt_user)
    
def preprocessing_02_general(df):
    # Lower case the text tweets
    df['text'] = df['text'].str.lower()

    # Strip out the meaningless links
    df['text'] = df['text'].apply(lambda x: " ".join([n for n in x.split() if n[0:4] != "http"]))

    # Strip any excess white space
    df['text'] = df['text'].apply(lambda x: x.strip())
    
    # Take out stop words
    sw = set(stopwords.words('english'))
    sw.update(['amp'])
    df['text'] = df['text'].apply(lambda x: " ".join([n for n in x.split() if n not in sw]))

    # Remove punctuation
    df['text'] = df['text'].apply(lambda x: remove_punctuation(x))

    # Make sure we don't have any random numbers
    df['text'] = df['text'].apply(lambda x: " ".join([n for n in x.split() if n.isnumeric() == False]))

    # Put together the RT user and the tweet text
    df['text'] = df['text'] + " " + df['RT_user']

    # Make a new column, tokenize the words
    df['text_tokenized'] = df['text'].str.split()
    
    df = df.drop(columns=['id', 'author_id', 'created_at'])
    
    df['text'] = df['text'].apply(lambda x: np.nan if len(x.strip()) == 0 else x)
    df = df.dropna().reset_index(drop=True) 

    le = LabelEncoder()
    df['class_label'] = le.fit_transform(df['class'])
    df.head()
    
def preprocessing_03_tag_and_lemmatize(df):
    df['text_tokenized'] = df['text_tokenized'].apply(tag_and_lemmatize)

## Data Collection

Data collection methods and code is located in a separate notebook linked ([here](notebook_02_data_collection.ipynb)).

# EDA

Exploratory data analysis methods and code is located in a separate notebook linked ([here](notebook_03_eda.ipynb)).

# Modeling

## Model 1 - Predict primary interest of user

### Load tweet data

Load the tweet data from file.  Model 1 uses tweet_list2.csv, which is a scaled down version of all tweet data. 

In [5]:
# Load tweets from file
tweet_list_file = 'tweet_list2.csv'
df = pd.read_csv(tweet_list_file)

# Format all series as strings
for n in df.columns:
    df[n] = df[n].astype(str)

# Check out the data
df.head()

Unnamed: 0,user_name,class,id,text,author_id,created_at
0,TeamPelosi,Politics - Liberal,1.62e+18,"On this day 83 years ago, Democrats Delivered the first Social Security checks ever! ...",2461810448.0,2023-01-31 21:00:26+00:00
1,TeamPelosi,Politics - Liberal,1.62e+18,We must keep our children safe from gun violence. Safe storage of guns saves lives and...,2461810448.0,2023-01-30 18:45:49+00:00
2,TeamPelosi,Politics - Liberal,1.62e+18,Democrats believe that health care is a human right and #DemocratsDelivered help for ...,2461810448.0,2023-01-28 21:20:12+00:00
3,TeamPelosi,Politics - Liberal,1.62e+18,Congratulations @PADems for your hard-won victories electing Pennsylvania Democrats wh...,2461810448.0,2023-01-28 04:00:31+00:00
4,TeamPelosi,Politics - Liberal,1.62e+18,My heart goes out to Tyre Nichols mother and their entire family. Tyre should be alive...,2461810448.0,2023-01-28 02:15:32+00:00


### Data cleaning

**Check for nulls**

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101074 entries, 0 to 101073
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   user_name   101074 non-null  object
 1   class       101074 non-null  object
 2   id          101074 non-null  object
 3   text        101074 non-null  object
 4   author_id   101074 non-null  object
 5   created_at  101074 non-null  object
dtypes: object(6)
memory usage: 4.6+ MB


Notes:
- There are no null values, which makes sense because I downloaded this data myself. 

**Check for duplicates**

In [7]:
df.duplicated().sum()

804

Notes:
- I have some duplicate tweets.  As I noted in the data collection notebook, I must have downloaded some tweets from the same account multiple times while performing the download function. 

**Drop duplicates**

In [8]:
df = df.drop_duplicates()
df.duplicated().sum()

0

Notes:
- Duplicates have been deleted.

### Data review

Check class balance at the tweet level

In [9]:
df['class'].value_counts()

Business and finance       21614
Science / Technology       15548
Politics - Conservative    15500
TV / movies                12007
Politics - Liberal         12001
Sports                     12000
Music                      11600
Name: class, dtype: int64

Notes: 
- It's imbalanced but I'm going to leave it and see if we can still make predictions from the data we have

### Pre-processing 

**Warning** This code performs all pre-processing, including lemmatization of the tweet text.  As such, it takes a few minutes to run.  

In [10]:
# Make a copy of the df, leave the original untouched
df_pp = df.copy()
preprocessing(df_pp)
df_model = df_pp.copy()

Make sure there's no nulls after processing

In [11]:
df_model.isna().sum()

user_name         0
class             0
id                0
text              0
author_id         0
created_at        0
RT_user           0
text_tokenized    0
dtype: int64

First, let's try to predict the primary interest of the user between our main classifications:
- Politics
- Sports and entertainment
- Business and finance
- Science / Technology

In [12]:
df_model.loc[(df_pp['class'] == 'Politics - Conservative') | (df_pp['class'] == 'Politics - Liberal'), 'class'] = 'Politics'
df_model.loc[(df_pp['class'] == 'Music') | (df_pp['class'] == 'TV / movies') | (df_pp['class'] == 'Sports'), 'class'] = 'Sports / Entertainment'
df_model.loc[(df_pp['class'] == 'Business and finance'), 'class'] = 'Business'
df_model = df_model.loc[(df_pp['class'] != 'Travel')]

df_model['class'].value_counts()

Sports / Entertainment    35607
Politics                  27501
Business                  21614
Science / Technology      15548
Name: class, dtype: int64

Aggregate all text words by account

In [13]:
df_model = df_model.groupby(['user_name', 'class']).agg({'text_tokenized': 'sum'}).reset_index()
df_model

Unnamed: 0,user_name,class,text_tokenized
0,20thcentury,Sports / Entertainment,"[titanic, sail, back, theater, valentine, day, weekend, 25th, anniversary, #theacademy..."
1,9to5mac,Science / Technology,"[9to5toys, last, call, eve, room, homekit, air, quality, monitor, mophie, snap, magsaf..."
2,ABCNetwork,Sports / Entertainment,"[even, betty, think, will, slip, miss, allnew, episode, willtrent, tonight, 109c, abc,..."
3,AOC,Politics,"[excite, humble, share, even, select, serve, repraskins, house, oversight, committee, ..."
4,Acyn,Politics,"[chad, comer, appear, coown, property, james, comer, receive, small, amount, covid, mo..."
...,...,...,...
170,techreview,Science / Technology,"[sign, download, daily, dose, whats, emerge, technology, subscribe, spark, newsletter,..."
171,tedcruz,Politics,"[mayorkas, impeach, month, ago, thank, sentedcruz, put, extra, emphasis, issue, #fairi..."
172,thedailybeast,Politics,"[favorite, part, jimmykimmel, ask, first, guest, pamela, anderson, ever, meet, mypillo..."
173,wbpictures,Sports / Entertainment,"[plan, up, up, away, dcstudios, dcu, dccomics, #jamesgunn#, time, grow, up, shazam, fu..."


In [14]:
df_model['class'].value_counts()

Sports / Entertainment    74
Politics                  56
Business                  26
Science / Technology      19
Name: class, dtype: int64

In [15]:
df_model['count_words'] = df_model['text_tokenized'].apply(len)
df_model

Unnamed: 0,user_name,class,text_tokenized,count_words
0,20thcentury,Sports / Entertainment,"[titanic, sail, back, theater, valentine, day, weekend, 25th, anniversary, #theacademy...",6348
1,9to5mac,Science / Technology,"[9to5toys, last, call, eve, room, homekit, air, quality, monitor, mophie, snap, magsaf...",8886
2,ABCNetwork,Sports / Entertainment,"[even, betty, think, will, slip, miss, allnew, episode, willtrent, tonight, 109c, abc,...",5739
3,AOC,Politics,"[excite, humble, share, even, select, serve, repraskins, house, oversight, committee, ...",7371
4,Acyn,Politics,"[chad, comer, appear, coown, property, james, comer, receive, small, amount, covid, mo...",5426
...,...,...,...,...
170,techreview,Science / Technology,"[sign, download, daily, dose, whats, emerge, technology, subscribe, spark, newsletter,...",11933
171,tedcruz,Politics,"[mayorkas, impeach, month, ago, thank, sentedcruz, put, extra, emphasis, issue, #fairi...",5211
172,thedailybeast,Politics,"[favorite, part, jimmykimmel, ask, first, guest, pamela, anderson, ever, meet, mypillo...",7388
173,wbpictures,Sports / Entertainment,"[plan, up, up, away, dcstudios, dcu, dccomics, #jamesgunn#, time, grow, up, shazam, fu...",6226


In [16]:
df_model['class'].value_counts(normalize=True)

Sports / Entertainment    0.422857
Politics                  0.320000
Business                  0.148571
Science / Technology      0.108571
Name: class, dtype: float64

Aggregate word count by class

In [17]:
df_model_by_class = df_model.groupby(['class']).agg({'count_words': 'sum'}).reset_index()
df_model_by_class

Unnamed: 0,class,count_words
0,Business,279610
1,Politics,370771
2,Science / Technology,223767
3,Sports / Entertainment,340745


### Train-test-split

In [18]:
X = df_model['text_tokenized']
y = df_model['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=model1_test_size)

### Dummy Classifier

Use Dummy Classifier to predict most frequent label

In [19]:
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X, y)
dummy_clf.predict(X)[0]

'Sports / Entertainment'

Get accuracy of the dummy classifier

In [20]:
dummy_clf.score(X, y)

0.4228571428571429

### Complement Naive Bayes Classifier

Use Tfidfvectorizer to vectorize the tweet text

In [21]:
# Instantiate a vectorizer with max_features=10
# (we are using the default token pattern)
tfidf = TfidfVectorizer(analyzer='word', tokenizer=dummy_fun, 
                        preprocessor=dummy_fun, token_pattern=None, 
                        ngram_range=(1,3), min_df=model1_min_df, max_features=model1_max_features)

# Fit the vectorizer on X_train["text"] and X_test
X_train_vectorized = tfidf.fit_transform(X_train)
X_test_vectorized = tfidf.transform(X_test)

Use a Complement Naive Bayes Classifier

In [22]:
# Import relevant class and function
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

# Instantiate a ComplementNB classifier
baseline_model = ComplementNB()
baseline_model.fit(X_train_vectorized, y_train)

# Evaluate the classifier on X_train_vectorized and y_train
baseline_cv = cross_val_score(baseline_model, X_train_vectorized, y_train)
baseline_cv.mean()

0.9714285714285713

In [23]:
# Evaluate the classifier on X_train_vectorized and y_train
baseline_cv = cross_val_score(baseline_model, X_test_vectorized, y_test)
baseline_cv.mean()

0.9714285714285715

### Save vectorizer and model to file

In [24]:
#store the content
with open("models/model1_tfidf.pkl", 'wb') as handle:
                    pickle.dump(tfidf, handle)

with open("models/model1_model.pkl", 'wb') as handle:
                    pickle.dump(baseline_model, handle)

## Model 2 - Predict political affiliation

### Load tweet data

Load the tweet data from file.  Load from full tweet file.  

In [25]:
# Load tweets from file
tweet_list_file = 'tweet_list.csv'
df = pd.read_csv(tweet_list_file)

# Format all series as strings
for n in df.columns:
    df[n] = df[n].astype(str)

# Check out the data
df.head()

Unnamed: 0,user_name,class,id,text,author_id,created_at
0,BennieGThompson,Politics - Liberal,1.62058e+18,"Today marks the 83rd anniversary of the first ever #SocialSecurity check, and Republic...",82453460.0,2023-02-01 00:45:11+00:00
1,BennieGThompson,Politics - Liberal,1.62012e+18,RT @VP: President Biden and I are just getting started. https://t.co/gLmNbpKGAN,82453460.0,2023-01-30 17:46:29+00:00
2,BennieGThompson,Politics - Liberal,1.62012e+18,"RT @RepJeffries: We will never negotiate away the health, safety or economic well-bein...",82453460.0,2023-01-30 17:46:12+00:00
3,BennieGThompson,Politics - Liberal,1.62012e+18,https://t.co/Ze7ePCUJJ2,82453460.0,2023-01-30 17:45:55+00:00
4,BennieGThompson,Politics - Liberal,1.62006e+18,https://t.co/ley5hNsz0y https://t.co/RFdTeGXGO1,82453460.0,2023-01-30 14:10:33+00:00


### Data cleaning

**Check for nulls**

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 136672 entries, 0 to 136671
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   user_name   136672 non-null  object
 1   class       136672 non-null  object
 2   id          136672 non-null  object
 3   text        136672 non-null  object
 4   author_id   136672 non-null  object
 5   created_at  136672 non-null  object
dtypes: object(6)
memory usage: 6.3+ MB


Notes:
- There are no null values, which makes sense because I downloaded this data myself. 

**Check for duplicates**

In [27]:
df.duplicated().sum()

878

Notes:
- I have some duplicate tweets.  As I noted in the data collection notebook, I must have downloaded some tweets from the same account multiple times while performing the download function. 

**Drop duplicates**

In [28]:
df = df.drop_duplicates()
df.duplicated().sum()

0

Notes:
- Duplicates have been deleted.

### Data review

Check class balance at the tweet level

In [29]:
df['class'].value_counts()

Politics - Conservative    31032
Politics - Liberal         26998
Business and finance       21614
Science / Technology       15548
TV / movies                12007
Sports                     12000
Music                      11600
Travel                      4995
Name: class, dtype: int64

Notes: 
- It's imbalanced but I'm going to leave it and see if we can still make predictions from the data we have

### Pre-processing 

**Warning** This code performs all pre-processing, including lemmatization of the tweet text.  As such, it takes a few minutes to run.  

In [30]:
# Make a copy of the df, leave the original untouched
df_pp = df.copy()
preprocessing(df_pp)
df_model = df_pp.copy()

Make sure there's no nulls after processing

In [31]:
df_model.isna().sum()

user_name         0
class             0
id                0
text              0
author_id         0
created_at        0
RT_user           0
text_tokenized    0
dtype: int64

Narrow it down to just politics - conservative and politicals - liberal:

In [32]:
df_model = df_model.loc[(df_pp['class'] == 'Politics - Conservative') | (df_model['class'] == 'Politics - Liberal')]
df_model['class'].value_counts()

Politics - Conservative    31032
Politics - Liberal         26998
Name: class, dtype: int64

Aggregate all text words by account

In [33]:
df_model = df_model.groupby(['user_name', 'class']).agg({'text_tokenized': 'sum'}).reset_index()
df_model

Unnamed: 0,user_name,class,text_tokenized
0,AOC,Politics - Liberal,"[excite, humble, share, even, select, serve, repraskins, house, oversight, committee, ..."
1,Acyn,Politics - Liberal,"[chad, comer, appear, coown, property, james, comer, receive, small, amount, covid, mo..."
2,AustinScottGA08,Politics - Conservative,"[today, international, holocaust, remembrance, day, please, join, honor, million, peop..."
3,BarackObama,Politics - Liberal,"[along, mourn, tyre, support, family, u, mobilize, last, change, learn, community, rei..."
4,BennieGThompson,Politics - Liberal,"[today, mark, 83rd, anniversary, first, ever, socialsecurity, check, republicans, cele..."
...,...,...,...
482,staceyabrams,Politics - Liberal,"[across, south, disability, poverty, yoke, georgia, help, eliminate, hbcs, wait, list,..."
483,stkirsch,Politics - Conservative,"[response, joel, smalley, mp, regard, evidence, safety, effectiveness, covid, vaccine,..."
484,tedcruz,Politics - Conservative,"[mayorkas, impeach, month, ago, thank, sentedcruz, put, extra, emphasis, issue, #fairi..."
485,thedailybeast,Politics - Liberal,"[favorite, part, jimmykimmel, ask, first, guest, pamela, anderson, ever, meet, mypillo..."


In [34]:
df_model['class'].value_counts()

Politics - Conservative    251
Politics - Liberal         236
Name: class, dtype: int64

In [35]:
df_model['count_words'] = df_model['text_tokenized'].apply(len)
df_model.head()

Unnamed: 0,user_name,class,text_tokenized,count_words
0,AOC,Politics - Liberal,"[excite, humble, share, even, select, serve, repraskins, house, oversight, committee, ...",7371
1,Acyn,Politics - Liberal,"[chad, comer, appear, coown, property, james, comer, receive, small, amount, covid, mo...",5426
2,AustinScottGA08,Politics - Conservative,"[today, international, holocaust, remembrance, day, please, join, honor, million, peop...",1262
3,BarackObama,Politics - Liberal,"[along, mourn, tyre, support, family, u, mobilize, last, change, learn, community, rei...",9587
4,BennieGThompson,Politics - Liberal,"[today, mark, 83rd, anniversary, first, ever, socialsecurity, check, republicans, cele...",938


In [36]:
df_model['class'].value_counts(normalize=True)

Politics - Conservative    0.5154
Politics - Liberal         0.4846
Name: class, dtype: float64

Aggregate word count by class

In [37]:
df_model_by_class = df_model.groupby(['class']).agg({'count_words': 'sum'}).reset_index()
df_model_by_class

Unnamed: 0,class,count_words
0,Politics - Conservative,443452
1,Politics - Liberal,465009


### Train-test-split

In [38]:
X = df_model['text_tokenized']
y = df_model['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=model2_test_size)

### Dummy Classifier

Use Dummy Classifier to predict most frequent label

In [39]:
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X, y)
dummy_clf.predict(X)[0]

'Politics - Conservative'

Get accuracy of the dummy classifier

In [40]:
dummy_clf.score(X, y)

0.5154004106776181

### Complement Naive Bayes Classifier

Use Tfidfvectorizer to vectorize the tweet text

In [41]:
# Instantiate a vectorizer with max_features=10
# (we are using the default token pattern)
tfidf = TfidfVectorizer(analyzer='word', tokenizer=dummy_fun, 
                        preprocessor=dummy_fun, token_pattern=None, 
                        ngram_range=(1,3), min_df=model2_min_df, max_features=model2_max_features)

# Fit the vectorizer on X_train["text"] and X_test
X_train_vectorized = tfidf.fit_transform(X_train)
X_test_vectorized = tfidf.transform(X_test)

Use a Complement Naive Bayes Classifier

In [42]:
# Import relevant class and function
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

# Instantiate a MultinomialNB classifier
baseline_model = ComplementNB()
baseline_model.fit(X_train_vectorized, y_train)

# Evaluate the classifier on X_train_vectorized and y_train
baseline_cv = cross_val_score(baseline_model, X_train_vectorized, y_train)
baseline_cv.mean()

0.9419053185271771

In [43]:
# Evaluate the classifier on X_train_vectorized and y_train
baseline_cv = cross_val_score(baseline_model, X_test_vectorized, y_test)
baseline_cv.mean()

0.9589743589743589

### Save vectorizer and model to file

In [44]:
with open("models/model2_tfidf.pkl", 'wb') as handle:
                    pickle.dump(tfidf, handle)

with open("models/model2_model.pkl", 'wb') as handle:
                    pickle.dump(baseline_model, handle)

## Deploy model

Deployment methods and code is located in a separate notebook linked ([here](notebook_04_deployment.ipynb)).

# Final Evaluation

XXXXXXX

# Recommendations

XXXXXXX

# Next Steps

XXXXXXX