# Overview

The goal of this project is to perform 

The goal of this project is to perform a sentiment analysis of Apple customers, and uncover actionable insight that could be used to optimize a marketing strategy going forward. To achieve this, we built a predictive model using Natural Language Processing (NLP), that could rate the sentiment of a tweet based on its content. At the end of our analysis, we present the findings of our model and provide concrete recommendations as to how Apple could improve its marketing strategy going forward and ultimately increase customer satisfaction.

# Business Understanding

Developing an excellent marketing strategy is crucial 


Developing an excellent marketing strategy is crucial for an organization to consistently achieve positive results. To perform effective marketing, companies need to gain a deep understanding of their customers and uncover what matters to them most. The challenge is figuring out how to gain this insight in an efficient manner, and how to consistently implement meaningful change. Fortunately, machine learning provides us with unique and effective tools to perform customer sentiment analysis and guide long-term decision making.

# Data Understanding


For this analysis, I utilized tweet data from 115,511 tweets from 587 Twitter accounts that were pulled from the Twitter API.  These accounts were manually selected by me to represent each account class that I am trying to predict.    

## Imports / settings

In [1]:
# General imports
import string

# Analysis imports
import pandas as pd
import numpy as np

# NLP imports
from nltk.stem import WordNetLemmatizer 
from nltk import pos_tag
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import regexp_tokenize, word_tokenize, RegexpTokenizer

# SKlearn imports
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import ComplementNB
from sklearn.preprocessing import LabelEncoder
from sklearn.dummy import DummyClassifier

# Visualization imports
import matplotlib.pyplot as plt
import seaborn as sns

# Pandas settings
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 90

# Downloads (for NLP)
import nltk
nltk.download('wordnet')
nltk.download('tagsets')
nltk.download('averaged_perceptron_tagger');

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\natek\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\natek\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\natek\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Constants

In [2]:
tweet_list_file = 'tweet_list.csv'

## Functions

These are helper functions that assist in the manipulation of tweet strings for pre-processing purposes.

In [3]:
def strip_rt_user(text):
    if text[0:2] == "RT":
        colon = text.find(":")
        return text[colon+1:].lower()
    else:
        return text.lower()

def get_rt_user(text):
    if text[0:2] == "RT":
        colon = text.find(":")
        user = text[:colon]
        at = user.find("@")
        return (user[at+1:]).lower()
    else:
        return ""

def addHashTags(text):
    return "#" + text + "#"

# Translate nltk POS to wordnet tags
def get_wordnet_pos(treebank_tag):
    '''
    Translate nltk POS to wordnet tags
    '''
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def remove_characters(text, char_to_remove):
    str1 = ''.join(x for x in text if not x in char_to_remove)
    return str1

def remove_punctuation(text):
    text = remove_characters(text, string.punctuation)
    return text

def tag_and_lemmatize(text):
    newText = text
    newText = pos_tag(newText)
    newText = [(x[0], get_wordnet_pos(x[1])) for x in newText]
    lemma = nltk.stem.WordNetLemmatizer()
    newText = [(lemma.lemmatize(x[0], x[1])) for x in newText]
    return newText

def dummy_fun(doc):
    return doc

# perform all pre-processing on a df
def preprocessing(df):
    preprocessing_01_model_specific(df)
    preprocessing_02_general(df)
    preprocessing_03_tag_and_lemmatize(df)
    
    
def preprocessing_01_model_specific(df):
    # Copy the RT user name from the text column and put it into a different column.
    df['RT_user'] = df['text'].apply(get_rt_user)
    df['RT_user'] = df['RT_user'].apply(lambda x: addHashTags(x) if x != "" else "")

    # Pull out the RT user name from the text column
    df['text'] = df['text'].apply(strip_rt_user)
    
def preprocessing_02_general(df):
    # Lower case the text tweets
    df['text'] = df['text'].str.lower()

    # Strip out the meaningless links
    df['text'] = df['text'].apply(lambda x: " ".join([n for n in x.split() if n[0:4] != "http"]))

    # Strip any excess white space
    df['text'] = df['text'].apply(lambda x: x.strip())
    
    # Take out stop words
    sw = set(stopwords.words('english'))
    sw.update(['amp'])
    df['text'] = df['text'].apply(lambda x: " ".join([n for n in x.split() if n not in sw]))

    # Remove punctuation
    df['text'] = df['text'].apply(lambda x: remove_punctuation(x))

    # Make sure we don't have any random numbers
    df['text'] = df['text'].apply(lambda x: " ".join([n for n in x.split() if n.isnumeric() == False]))

    # Put together the RT user and the tweet text
    df['text'] = df['text'] + " " + df['RT_user']

    # Make a new column, tokenize the words
    df['text_tokenized'] = df['text'].str.split()
    
    df = df.drop(columns=['id', 'author_id', 'created_at'])
    
    df['text'] = df['text'].apply(lambda x: np.nan if len(x.strip()) == 0 else x)
    df = df.dropna().reset_index(drop=True) 

    le = LabelEncoder()
    df['class_label'] = le.fit_transform(df['class'])
    df.head()
    
def preprocessing_03_tag_and_lemmatize(df):
    df['text_tokenized'] = df['text_tokenized'].apply(tag_and_lemmatize)

## Data Collection

Data collection methods and code is located in a separate notebook linked ([here](notebook_02_data_collection.ipynb)).

## Load tweet data

Load the tweet data from file.

In [4]:
# Load tweets from file
df = pd.read_csv(tweet_list_file)

# Format all series as strings
for n in df.columns:
    df[n] = df[n].astype(str)

# Check out the data
df.head()

Unnamed: 0,user_name,class,id,text,author_id,created_at
0,BennieGThompson,Politics - Liberal,1620584010991939584,"Today marks the 83rd anniversary of the first ever #SocialSecurity check, and Republic...",82453460,2023-02-01 00:45:11+00:00
1,BennieGThompson,Politics - Liberal,1620116251749269511,RT @VP: President Biden and I are just getting started. https://t.co/gLmNbpKGAN,82453460,2023-01-30 17:46:29+00:00
2,BennieGThompson,Politics - Liberal,1620116182618759168,"RT @RepJeffries: We will never negotiate away the health, safety or economic well-bein...",82453460,2023-01-30 17:46:12+00:00
3,BennieGThompson,Politics - Liberal,1620116109864357888,https://t.co/Ze7ePCUJJ2,82453460,2023-01-30 17:45:55+00:00
4,BennieGThompson,Politics - Liberal,1620061909113516036,https://t.co/ley5hNsz0y https://t.co/RFdTeGXGO1,82453460,2023-01-30 14:10:33+00:00


## Data cleaning

**Check for nulls**

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115511 entries, 0 to 115510
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   user_name   115511 non-null  object
 1   class       115511 non-null  object
 2   id          115511 non-null  object
 3   text        115511 non-null  object
 4   author_id   115511 non-null  object
 5   created_at  115511 non-null  object
dtypes: object(6)
memory usage: 5.3+ MB


Notes:
- There are no null values, which makes sense because I downloaded this data myself. 

**Check for duplicates**

In [6]:
df.duplicated().sum()

877

Notes:
- I have some duplicate tweets.  As I noted in the data collection notebook, I must have downloaded some tweets from the same account multiple times while performing the download function. 

**Drop duplicates**

In [7]:
df = df.drop_duplicates()
df.duplicated().sum()

0

Notes:
- Duplicates have been deleted.

## Data review

Check class balance at the tweet level

In [8]:
df['class'].value_counts()

Politics - Conservative    31032
Politics - Liberal         26998
TV / movies                12007
Sports                     12000
Music                      11600
Business and finance        8452
Science / Technology        7550
Travel                      4995
Name: class, dtype: int64

Notes: 
- It's imbalanced but I'm going to leave it and see if we can still make predictions from the data we have

# Modeling

## Pre-processing 

**Warning** This code performs all pre-processing, including lemmatization of the tweet text.  As such, it takes a few minutes to run.  

In [9]:
# Make a copy of the df, leave the original untouched
df_pp = df.copy()
preprocessing(df_pp)
df_pp

Unnamed: 0,user_name,class,id,text,author_id,created_at,RT_user,text_tokenized
0,BennieGThompson,Politics - Liberal,1620584010991939584,today marks 83rd anniversary first ever socialsecurity check republicans celebrating t...,82453460,2023-02-01 00:45:11+00:00,,"[today, mark, 83rd, anniversary, first, ever, socialsecurity, check, republicans, cele..."
1,BennieGThompson,Politics - Liberal,1620116251749269511,president biden getting started #vp#,82453460,2023-01-30 17:46:29+00:00,#vp#,"[president, biden, get, start, #vp#]"
2,BennieGThompson,Politics - Liberal,1620116182618759168,never negotiate away health safety economic wellbeing american people #repjeffries#,82453460,2023-01-30 17:46:12+00:00,#repjeffries#,"[never, negotiate, away, health, safety, economic, wellbeing, american, people, #repje..."
3,BennieGThompson,Politics - Liberal,1620116109864357888,,82453460,2023-01-30 17:45:55+00:00,,[]
4,BennieGThompson,Politics - Liberal,1620061909113516036,,82453460,2023-01-30 14:10:33+00:00,,[]
...,...,...,...,...,...,...,...,...
115506,RepLCD,Politics - Conservative,1611786100825006080,great catch friend repfeenstra last night were ready get work amp deliver promises mad...,1583530102297600000,2023-01-07 18:05:26+00:00,,"[great, catch, friend, repfeenstra, last, night, be, ready, get, work, amp, deliver, p..."
115507,RepLCD,Politics - Conservative,1611615029660639233,thank or05 placing trust represent halls congress solemn promise oregonians carry cons...,1583530102297600000,2023-01-07 06:45:40+00:00,,"[thank, or05, place, trust, represent, hall, congress, solemn, promise, oregonian, car..."
115508,RepLCD,Politics - Conservative,1610791524807081986,small minority preventing house work sent do must get economy back track work get cost...,1583530102297600000,2023-01-05 00:13:21+00:00,,"[small, minority, prevent, house, work, send, do, must, get, economy, back, track, wor..."
115509,RepLCD,Politics - Conservative,1610408428052295681,take responsibility serving or05 im grateful family side,1583530102297600000,2023-01-03 22:51:03+00:00,,"[take, responsibility, serve, or05, im, grateful, family, side]"


In [10]:
df_pp.user_name.unique()

array(['BennieGThompson', 'BettyMcCollum04', 'BillPascrell', 'BobbyScott',
       'BradSherman', 'Call_Me_Dutch', 'chelliepingree', 'CongBoyle',
       'CongressmanRaja', 'CongresswomanSC', 'DonaldNorcross',
       'DorisMatsui', 'EleanorNorton', 'FrankPallone', 'GerryConnolly',
       'gracenapolitano', 'GuamCongressman', 'Ilhan', 'JacksonLeeTX18',
       'JoaquinCastrotx', 'Kilili_Sablan', 'NormaJTorres',
       'NydiaVelazquez', 'TeamPelosi', 'AOC', 'staceyabrams', 'ewarren',
       'SenWarren', 'JoeBiden', 'KamalaHarris', 'BarackObama',
       'HillaryClinton', 'BillClinton', 'WhiteHouse', 'POTUS', 'MSNBC',
       'HuffPost', 'CNNPolitics', 'TheAtlantic', 'MotherJones',
       'thedailybeast', 'JoyAnnReid', 'DNC', 'Acyn', 'MeidasTouch',
       'briantylercohen', 'mmpadellan', 'MarkRuffalo', 'laurenboebert',
       'mattgaetz', 'tedcruz', 'RandPaul', 'GOP', 'RNCResearch',
       'foxnewspolitics', 'BreitbartNews', 'NEWSMAX', 'TheDCPolitics',
       'OANN', 'realDailyWire', 'JesseBWa

Make sure there's no nulls after processing

In [11]:
df_pp.isna().sum()

user_name         0
class             0
id                0
text              0
author_id         0
created_at        0
RT_user           0
text_tokenized    0
dtype: int64

First, let's try to predict the primary interest of the user between our main classifications:
- Politics
- Sports
- TV / movies
- Business and finance
- Music
- Travel
- Science / Technology

In [12]:
df_pp.loc[(df_pp['class'] == 'Politics - Conservative') | (df_pp['class'] == 'Politics - Liberal'), 'class'] = 'Politics'
df_pp['class'].value_counts()

Politics                58030
TV / movies             12007
Sports                  12000
Music                   11600
Business and finance     8452
Science / Technology     7550
Travel                   4995
Name: class, dtype: int64

Aggregate all text words by account

In [13]:
df_model = df_pp.groupby(['user_name', 'class']).agg({'text_tokenized': 'sum'}).reset_index()
df_model

Unnamed: 0,user_name,class,text_tokenized
0,20thcentury,TV / movies,"[titanic, sail, back, theater, valentine, day, weekend, 25th, anniversary, #theacademy..."
1,9to5mac,Science / Technology,"[9to5toys, last, call, eve, room, homekit, air, quality, monitor, mophie, snap, magsaf..."
2,ABCNetwork,TV / movies,"[even, betty, think, will, slip, miss, allnew, episode, willtrent, tonight, 109c, abc,..."
3,AOC,Politics,"[excite, humble, share, even, select, serve, repraskins, house, oversight, committee, ..."
4,Acyn,Politics,"[chad, comer, appear, coown, property, james, comer, receive, small, amount, covid, mo..."
...,...,...,...
581,travelchannel,Travel,"[late, episode, kindredspirits, u, like, miss, it, stream, discoveryplus, amybruni, ad..."
582,travelocity,Travel,"[sometimes, hard, part, travel, start, pack, process, weve, do, hard, part, ya, tag, f..."
583,virginiafoxx,Politics,"[regular, order, restore, people, house, student, reward, hard, work, education, burea..."
584,wbpictures,TV / movies,"[plan, up, up, away, dcstudios, dcu, dccomics, #jamesgunn#, time, grow, up, shazam, fu..."


In [14]:
df_model['class'].value_counts()

Politics                487
Music                    29
Sports                   24
TV / movies              21
Business and finance     10
Science / Technology      9
Travel                    6
Name: class, dtype: int64

In [15]:
df_model['count_words'] = df_model['text_tokenized'].apply(len)
df_model

Unnamed: 0,user_name,class,text_tokenized,count_words
0,20thcentury,TV / movies,"[titanic, sail, back, theater, valentine, day, weekend, 25th, anniversary, #theacademy...",6348
1,9to5mac,Science / Technology,"[9to5toys, last, call, eve, room, homekit, air, quality, monitor, mophie, snap, magsaf...",8886
2,ABCNetwork,TV / movies,"[even, betty, think, will, slip, miss, allnew, episode, willtrent, tonight, 109c, abc,...",5739
3,AOC,Politics,"[excite, humble, share, even, select, serve, repraskins, house, oversight, committee, ...",7371
4,Acyn,Politics,"[chad, comer, appear, coown, property, james, comer, receive, small, amount, covid, mo...",5426
...,...,...,...,...
581,travelchannel,Travel,"[late, episode, kindredspirits, u, like, miss, it, stream, discoveryplus, amybruni, ad...",7620
582,travelocity,Travel,"[sometimes, hard, part, travel, start, pack, process, weve, do, hard, part, ya, tag, f...",10614
583,virginiafoxx,Politics,"[regular, order, restore, people, house, student, reward, hard, work, education, burea...",746
584,wbpictures,TV / movies,"[plan, up, up, away, dcstudios, dcu, dccomics, #jamesgunn#, time, grow, up, shazam, fu...",6226


Aggregate word count by class

In [17]:
df_model_by_class = df_model.groupby(['class']).agg({'count_words': 'sum'}).reset_index()
df_model_by_class

Unnamed: 0,class,count_words
0,Business and finance,110475
1,Music,108593
2,Politics,908577
3,Science / Technology,109941
4,Sports,107750
5,TV / movies,124402
6,Travel,64818


## Train-test-split

In [62]:
X = df_model['text_tokenized']
y = df_model['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.50, stratify=y)

In [63]:
X_train

373    [repmoolenaar, recently, sponsor, legislation, would, repeal, unobligated, balance, ap...
425    [year, ago, social, security, issue, first, check, housedemocrats, keep, fight, protec...
437    [state, public, utility, commission, begin, investigation, lead, technical, issue, off...
37     [the, family, have, make, rich, friend, choose, make, family, thing, youll, never, fee...
526    [year, ago, today, first, ever, socialsecurity, check, go, out, republican, celebrate,...
                                                 ...                                            
486    [history, make, chair, cathymcmorris, rodgers, first, woman, chair, energy, commerce, ...
563    [america, must, stand, ally, unequivocally, defend, israel, right, defend, existential...
516    [break, true, vote, twitter, account, truethevote, reinstate, break, sidney, powell, t...
228    [excited, get, work, work, family, family, farmer, team, honor, welcome, el, florista,...
78     [theyll, threaten, pros

## Dummy Classifier

Use Dummy Classifier to predict most frequent label

In [64]:
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X, y)
dummy_clf.predict(X)[0]

'Politics'

Get accuracy of the dummy classifier

In [65]:
dummy_clf.score(X, y)

0.8310580204778157

## Multinomial Naive Bayes Classifier

Use Tfidfvectorizer to vectorize the tweet text

In [66]:
# Instantiate a vectorizer with max_features=10
# (we are using the default token pattern)
tfidf = TfidfVectorizer(analyzer='word', tokenizer=dummy_fun, 
                        preprocessor=dummy_fun, token_pattern=None, 
                        ngram_range=(1,3), min_df=2, max_features=700)

# Fit the vectorizer on X_train["text"] and transform it
X_train_vectorized = tfidf.fit_transform(X_train)

# Visually inspect the vectorized data
pd.DataFrame.sparse.from_spmatrix(X_train_vectorized, columns=tfidf.get_feature_names())

Unnamed: 0,#dineshdsouza#,118th,118th congress,able,abortion,access,account,accountable,across,across country,...,write,year,year ago,yes,yesterday,yet,york,you,young,youre
0,0.0,0.000000,0.000000,0.000000,0.000000,0.039983,0.000000,0.000000,0.016451,0.000000,...,0.000000,0.054472,0.000000,0.000000,0.000000,0.000000,0.000000,0.018433,0.022558,0.000000
1,0.0,0.053728,0.055058,0.000000,0.099914,0.111553,0.000000,0.017837,0.052457,0.021000,...,0.000000,0.130269,0.016811,0.000000,0.017346,0.034555,0.000000,0.000000,0.017982,0.000000
2,0.0,0.000000,0.000000,0.038184,0.020702,0.082550,0.000000,0.000000,0.081520,0.021757,...,0.000000,0.101220,0.017416,0.024360,0.017971,0.000000,0.000000,0.030446,0.018630,0.000000
3,0.0,0.000000,0.000000,0.001905,0.000000,0.001648,0.000000,0.000000,0.001356,0.000000,...,0.018482,0.021324,0.008690,0.000000,0.001793,0.001786,0.000000,0.015191,0.003718,0.004738
4,0.0,0.049791,0.051023,0.000000,0.055555,0.066457,0.000000,0.024795,0.000000,0.000000,...,0.000000,0.150903,0.046737,0.000000,0.000000,0.024017,0.068635,0.040851,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
288,0.0,0.087691,0.089863,0.022558,0.024461,0.000000,0.000000,0.021835,0.000000,0.000000,...,0.000000,0.079731,0.041157,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
289,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.024725,0.000000,0.000000,0.000000,...,0.020986,0.081558,0.015787,0.022082,0.000000,0.000000,0.000000,0.000000,0.016887,0.000000
290,0.0,0.000000,0.000000,0.019634,0.010645,0.008489,0.175316,0.004751,0.000000,0.000000,...,0.011904,0.028915,0.008955,0.006263,0.000000,0.013806,0.000000,0.007827,0.004790,0.030516
291,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.067955


Use a Complement Naive Bayes Classifier

In [67]:
# Import relevant class and function
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

# Instantiate a MultinomialNB classifier
baseline_model = ComplementNB()

# Evaluate the classifier on X_train_vectorized and y_train
baseline_cv = cross_val_score(baseline_model, X_train_vectorized, y_train)
baseline_cv.mean()



0.9281122150789013

In [68]:
# Fit the vectorizer on X_train["text"] and transform it
X_test_vectorized = tfidf.transform(X_test)

# Visually inspect the vectorized data
# pd.DataFrame.sparse.from_spmatrix(X_test_vectorized, columns=tfidf.get_feature_names())

In [69]:
# Evaluate the classifier on X_train_vectorized and y_train
baseline_cv = cross_val_score(baseline_model, X_test_vectorized, y_test)
baseline_cv.mean()



0.9591466978375219