# Overview

The goal of this project is to perform 

The goal of this project is to perform a sentiment analysis of Apple customers, and uncover actionable insight that could be used to optimize a marketing strategy going forward. To achieve this, we built a predictive model using Natural Language Processing (NLP), that could rate the sentiment of a tweet based on its content. At the end of our analysis, we present the findings of our model and provide concrete recommendations as to how Apple could improve its marketing strategy going forward and ultimately increase customer satisfaction.

# Business Understanding

Developing an excellent marketing strategy is crucial 


Developing an excellent marketing strategy is crucial for an organization to consistently achieve positive results. To perform effective marketing, companies need to gain a deep understanding of their customers and uncover what matters to them most. The challenge is figuring out how to gain this insight in an efficient manner, and how to consistently implement meaningful change. Fortunately, machine learning provides us with unique and effective tools to perform customer sentiment analysis and guide long-term decision making.

# Data Understanding


For this analysis, I utilized tweet data from 115,511 tweets from 587 Twitter accounts that were pulled from the Twitter API.  These accounts were manually selected by me to represent each account class that I am trying to predict.    

## Imports / settings

In [1]:
# General imports
import string

# Twitter import
import tweepy

# Analysis imports
import pandas as pd
import numpy as np

# NLP imports
from nltk.stem import WordNetLemmatizer 
from nltk import pos_tag
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import regexp_tokenize, word_tokenize, RegexpTokenizer

# SKlearn imports
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import ComplementNB
from sklearn.preprocessing import LabelEncoder
from sklearn.dummy import DummyClassifier

# Visualization imports
import matplotlib.pyplot as plt
import seaborn as sns

# Pandas settings
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 90

# Downloads (for NLP)
import nltk
nltk.download('wordnet')
nltk.download('tagsets')
nltk.download('averaged_perceptron_tagger');

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\natek\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\natek\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\natek\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Constants

In [2]:
tweet_list_file = 'tweet_list2.csv'

## Functions

These are helper functions that assist in the manipulation of tweet strings for pre-processing purposes.

In [3]:
def strip_rt_user(text):
    if text[0:2] == "RT":
        colon = text.find(":")
        return text[colon+1:].lower()
    else:
        return text.lower()

def get_rt_user(text):
    if text[0:2] == "RT":
        colon = text.find(":")
        user = text[:colon]
        at = user.find("@")
        return (user[at+1:]).lower()
    else:
        return ""

def addHashTags(text):
    return "#" + text + "#"

# Translate nltk POS to wordnet tags
def get_wordnet_pos(treebank_tag):
    '''
    Translate nltk POS to wordnet tags
    '''
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def remove_characters(text, char_to_remove):
    str1 = ''.join(x for x in text if not x in char_to_remove)
    return str1

def remove_punctuation(text):
    text = remove_characters(text, string.punctuation)
    return text

def tag_and_lemmatize(text):
    newText = text
    newText = pos_tag(newText)
    newText = [(x[0], get_wordnet_pos(x[1])) for x in newText]
    lemma = nltk.stem.WordNetLemmatizer()
    newText = [(lemma.lemmatize(x[0], x[1])) for x in newText]
    return newText

def dummy_fun(doc):
    return doc

# perform all pre-processing on a df
def preprocessing(df):
    preprocessing_01_model_specific(df)
    preprocessing_02_general(df)
    preprocessing_03_tag_and_lemmatize(df)
    
    
def preprocessing_01_model_specific(df):
    # Copy the RT user name from the text column and put it into a different column.
    df['RT_user'] = df['text'].apply(get_rt_user)
    df['RT_user'] = df['RT_user'].apply(lambda x: addHashTags(x) if x != "" else "")

    # Pull out the RT user name from the text column
    df['text'] = df['text'].apply(strip_rt_user)
    
def preprocessing_02_general(df):
    # Lower case the text tweets
    df['text'] = df['text'].str.lower()

    # Strip out the meaningless links
    df['text'] = df['text'].apply(lambda x: " ".join([n for n in x.split() if n[0:4] != "http"]))

    # Strip any excess white space
    df['text'] = df['text'].apply(lambda x: x.strip())
    
    # Take out stop words
    sw = set(stopwords.words('english'))
    sw.update(['amp'])
    df['text'] = df['text'].apply(lambda x: " ".join([n for n in x.split() if n not in sw]))

    # Remove punctuation
    df['text'] = df['text'].apply(lambda x: remove_punctuation(x))

    # Make sure we don't have any random numbers
    df['text'] = df['text'].apply(lambda x: " ".join([n for n in x.split() if n.isnumeric() == False]))

    # Put together the RT user and the tweet text
    df['text'] = df['text'] + " " + df['RT_user']

    # Make a new column, tokenize the words
    df['text_tokenized'] = df['text'].str.split()
    
    df = df.drop(columns=['id', 'author_id', 'created_at'])
    
    df['text'] = df['text'].apply(lambda x: np.nan if len(x.strip()) == 0 else x)
    df = df.dropna().reset_index(drop=True) 

    le = LabelEncoder()
    df['class_label'] = le.fit_transform(df['class'])
    df.head()
    
def preprocessing_03_tag_and_lemmatize(df):
    df['text_tokenized'] = df['text_tokenized'].apply(tag_and_lemmatize)

## Data Collection

Data collection methods and code is located in a separate notebook linked ([here](notebook_02_data_collection.ipynb)).

## Load tweet data

Load the tweet data from file.

In [4]:
# Load tweets from file
df = pd.read_csv(tweet_list_file)

# Format all series as strings
for n in df.columns:
    df[n] = df[n].astype(str)

# Check out the data
df.head()

Unnamed: 0,user_name,class,id,text,author_id,created_at
0,TeamPelosi,Politics - Liberal,1620527449326108672,"On this day 83 years ago, Democrats Delivered the first Social Security checks ever! ...",2461810448,2023-01-31 21:00:26+00:00
1,TeamPelosi,Politics - Liberal,1620131183597359104,We must keep our children safe from gun violence. Safe storage of guns saves lives and...,2461810448,2023-01-30 18:45:49+00:00
2,TeamPelosi,Politics - Liberal,1619445261784477696,Democrats believe that health care is a human right and #DemocratsDelivered help for ...,2461810448,2023-01-28 21:20:12+00:00
3,TeamPelosi,Politics - Liberal,1619183614050156544,Congratulations @PADems for your hard-won victories electing Pennsylvania Democrats wh...,2461810448,2023-01-28 04:00:31+00:00
4,TeamPelosi,Politics - Liberal,1619157193626116098,My heart goes out to Tyre Nichols mother and their entire family. Tyre should be alive...,2461810448,2023-01-28 02:15:32+00:00


## Data cleaning

**Check for nulls**

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84908 entries, 0 to 84907
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_name   84908 non-null  object
 1   class       84908 non-null  object
 2   id          84908 non-null  object
 3   text        84908 non-null  object
 4   author_id   84908 non-null  object
 5   created_at  84908 non-null  object
dtypes: object(6)
memory usage: 3.9+ MB


Notes:
- There are no null values, which makes sense because I downloaded this data myself. 

**Check for duplicates**

In [6]:
df.duplicated().sum()

803

Notes:
- I have some duplicate tweets.  As I noted in the data collection notebook, I must have downloaded some tweets from the same account multiple times while performing the download function. 

**Drop duplicates**

In [7]:
df = df.drop_duplicates()
df.duplicated().sum()

0

Notes:
- Duplicates have been deleted.

## Data review

Check class balance at the tweet level

In [8]:
df['class'].value_counts()

Politics - Conservative    15500
TV / movies                12007
Politics - Liberal         12001
Sports                     12000
Music                      11600
Business and finance        8452
Science / Technology        7550
Travel                      4995
Name: class, dtype: int64

Notes: 
- It's imbalanced but I'm going to leave it and see if we can still make predictions from the data we have

# Modeling

## Model 1 - Predict primary interest of account from 6 major categories

### Pre-processing 

**Warning** This code performs all pre-processing, including lemmatization of the tweet text.  As such, it takes a few minutes to run.  

In [36]:
# Make a copy of the df, leave the original untouched
df_pp = df.copy()
preprocessing(df_pp)
df_pp

Unnamed: 0,user_name,class,id,text,author_id,created_at,RT_user,text_tokenized
0,TeamPelosi,Politics - Liberal,1620527449326108672,day years ago democrats delivered first social security checks ever republicans social...,2461810448,2023-01-31 21:00:26+00:00,,"[day, year, ago, democrat, deliver, first, social, security, check, ever, republicans,..."
1,TeamPelosi,Politics - Liberal,1620131183597359104,must keep children safe gun violence safe storage guns saves lives heartbreak thats ho...,2461810448,2023-01-30 18:45:49+00:00,,"[must, keep, child, safe, gun, violence, safe, storage, gun, save, life, heartbreak, t..."
2,TeamPelosi,Politics - Liberal,1619445261784477696,democrats believe health care human right democratsdelivered help families get coverag...,2461810448,2023-01-28 21:20:12+00:00,,"[democrat, believe, health, care, human, right, democratsdelivered, help, family, get,..."
3,TeamPelosi,Politics - Liberal,1619183614050156544,congratulations padems hardwon victories electing pennsylvania democrats committed gro...,2461810448,2023-01-28 04:00:31+00:00,,"[congratulation, padems, hardwon, victory, elect, pennsylvania, democrat, commit, grow..."
4,TeamPelosi,Politics - Liberal,1619157193626116098,heart goes tyre nichols mother entire family tyre alive today justice must done must r...,2461810448,2023-01-28 02:15:32+00:00,,"[heart, go, tyre, nichols, mother, entire, family, tyre, alive, today, justice, must, ..."
...,...,...,...,...,...,...,...,...
84903,BBC_Travel,Travel,1317805131052634115,pictureperfect medieval city may show us live better,173992307,2020-10-18 12:30:01+00:00,,"[pictureperfect, medieval, city, may, show, u, live, good]"
84904,BBC_Travel,Travel,1317586183845605377,stunning alpine landscape lead nietzsche proclaim god dead,173992307,2020-10-17 22:00:00+00:00,,"[stun, alpine, landscape, lead, nietzsche, proclaim, god, dead]"
84905,BBC_Travel,Travel,1317442744374079491,writing system world used exclusively women,173992307,2020-10-17 12:30:01+00:00,,"[write, system, world, use, exclusively, woman]"
84906,BBC_Travel,Travel,1317223800035971072,steak spaghetti ketchup postwar japan yoshoku become countrys defacto comfort food act...,173992307,2020-10-16 22:00:01+00:00,,"[steak, spaghetti, ketchup, postwar, japan, yoshoku, become, countrys, defacto, comfor..."


Make sure there's no nulls after processing

In [116]:
df_pp.isna().sum()

user_name         0
class             0
id                0
text              0
author_id         0
created_at        0
RT_user           0
text_tokenized    0
dtype: int64

First, let's try to predict the primary interest of the user between our main classifications:
- Politics
- Sports
- Entertainment
- Business and finance
- Travel
- Science / Technology

In [39]:
df_pp.loc[(df_pp['class'] == 'Politics - Conservative') | (df_pp['class'] == 'Politics - Liberal'), 'class'] = 'Politics'
df_pp.loc[(df_pp['class'] == 'Music') | (df_pp['class'] == 'TV / movies'), 'class'] = 'Entertainment'
df_pp.loc[(df_pp['class'] == 'Business and finance'), 'class'] = 'Business'
df_pp['class'].value_counts()

Politics                27501
Entertainment           23607
Sports                  12000
Business                 8452
Science / Technology     7550
Travel                   4995
Name: class, dtype: int64

Aggregate all text words by account

In [40]:
df_model = df_pp.groupby(['user_name', 'class']).agg({'text_tokenized': 'sum'}).reset_index()
df_model

Unnamed: 0,user_name,class,text_tokenized
0,20thcentury,Entertainment,"[titanic, sail, back, theater, valentine, day, weekend, 25th, anniversary, #theacademy..."
1,9to5mac,Science / Technology,"[9to5toys, last, call, eve, room, homekit, air, quality, monitor, mophie, snap, magsaf..."
2,ABCNetwork,Entertainment,"[even, betty, think, will, slip, miss, allnew, episode, willtrent, tonight, 109c, abc,..."
3,AOC,Politics,"[excite, humble, share, even, select, serve, repraskins, house, oversight, committee, ..."
4,Acyn,Politics,"[chad, comer, appear, coown, property, james, comer, receive, small, amount, covid, mo..."
...,...,...,...
150,thedailybeast,Politics,"[favorite, part, jimmykimmel, ask, first, guest, pamela, anderson, ever, meet, mypillo..."
151,travelchannel,Travel,"[late, episode, kindredspirits, u, like, miss, it, stream, discoveryplus, amybruni, ad..."
152,travelocity,Travel,"[sometimes, hard, part, travel, start, pack, process, weve, do, hard, part, ya, tag, f..."
153,wbpictures,Entertainment,"[plan, up, up, away, dcstudios, dcu, dccomics, #jamesgunn#, time, grow, up, shazam, fu..."


In [41]:
df_model['class'].value_counts()

Politics                56
Entertainment           50
Sports                  24
Business                10
Science / Technology     9
Travel                   6
Name: class, dtype: int64

In [42]:
df_model['count_words'] = df_model['text_tokenized'].apply(len)
df_model

Unnamed: 0,user_name,class,text_tokenized,count_words
0,20thcentury,Entertainment,"[titanic, sail, back, theater, valentine, day, weekend, 25th, anniversary, #theacademy...",6348
1,9to5mac,Science / Technology,"[9to5toys, last, call, eve, room, homekit, air, quality, monitor, mophie, snap, magsaf...",8886
2,ABCNetwork,Entertainment,"[even, betty, think, will, slip, miss, allnew, episode, willtrent, tonight, 109c, abc,...",5739
3,AOC,Politics,"[excite, humble, share, even, select, serve, repraskins, house, oversight, committee, ...",7371
4,Acyn,Politics,"[chad, comer, appear, coown, property, james, comer, receive, small, amount, covid, mo...",5426
...,...,...,...,...
150,thedailybeast,Politics,"[favorite, part, jimmykimmel, ask, first, guest, pamela, anderson, ever, meet, mypillo...",7388
151,travelchannel,Travel,"[late, episode, kindredspirits, u, like, miss, it, stream, discoveryplus, amybruni, ad...",7620
152,travelocity,Travel,"[sometimes, hard, part, travel, start, pack, process, weve, do, hard, part, ya, tag, f...",10614
153,wbpictures,Entertainment,"[plan, up, up, away, dcstudios, dcu, dccomics, #jamesgunn#, time, grow, up, shazam, fu...",6226


In [43]:
df_model['class'].value_counts(normalize=True)

Politics                0.361290
Entertainment           0.322581
Sports                  0.154839
Business                0.064516
Science / Technology    0.058065
Travel                  0.038710
Name: class, dtype: float64

Aggregate word count by class

In [44]:
df_model_by_class = df_model.groupby(['class']).agg({'count_words': 'sum'}).reset_index()
df_model_by_class

Unnamed: 0,class,count_words
0,Business,110475
1,Entertainment,232995
2,Politics,370771
3,Science / Technology,109941
4,Sports,107750
5,Travel,64818


### Train-test-split

In [90]:
X = df_model['text_tokenized']
y = df_model['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.35, stratify=y)

### Dummy Classifier

Use Dummy Classifier to predict most frequent label

In [91]:
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X, y)
dummy_clf.predict(X)[0]

'Politics'

Get accuracy of the dummy classifier

In [92]:
dummy_clf.score(X, y)

0.36129032258064514

### Multinomial Naive Bayes Classifier

Use Tfidfvectorizer to vectorize the tweet text

In [110]:
# Instantiate a vectorizer with max_features=10
# (we are using the default token pattern)
tfidf = TfidfVectorizer(analyzer='word', tokenizer=dummy_fun, 
                        preprocessor=dummy_fun, token_pattern=None, 
                        ngram_range=(1,3), min_df=2, max_features=750)

# Fit the vectorizer on X_train["text"] and transform it
X_train_vectorized = tfidf.fit_transform(X_train)

Use a Complement Naive Bayes Classifier

In [111]:
# Import relevant class and function
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

# Instantiate a MultinomialNB classifier
baseline_model = ComplementNB()

# Evaluate the classifier on X_train_vectorized and y_train
baseline_cv = cross_val_score(baseline_model, X_train_vectorized, y_train)
baseline_cv.mean()



0.95

In [112]:
# Fit the vectorizer on X_train["text"] and transform it
X_test_vectorized = tfidf.transform(X_test)

In [113]:
# Evaluate the classifier on X_train_vectorized and y_train
baseline_cv = cross_val_score(baseline_model, X_test_vectorized, y_test)
baseline_cv.mean()



0.7818181818181819

In [114]:
bearer_key = 'AAAAAAAAAAAAAAAAAAAAAAP3lAEAAAAAWiRYIS1QJmco7YZB4oL%2BhLg1R3c%3DmvYmGNwcKhY145AcnvJzFaJlMZ2G7aeovV9VFB5qG9NiNkizEm'

def get_tweets(username, class_, number_of_tweets):
    # This is the key to use to download the tweets
   
    client = tweepy.Client(bearer_token=bearer_key)
    user_id = client.get_user(username=username).data.id

    # Uses the paginator to request as many tweets as we want (paginator makes it possible to download more than 100 at a time
    tweets = []
    for tweet in tweepy.Paginator(client.get_users_tweets, user_id, tweet_fields=['created_at', 'author_id'],expansions=[''], max_results=100, exclude=['replies']).flatten(limit=number_of_tweets):
        # Scrub the text of any non-readable characters
        text = "".join(i for i in tweet.text if i in string.printable)
        # Scrub the text of any newlines
        text = text.replace("\n", " ")
        # Put the tweet info into a new dictionary
        tweets.append({
            "user_name"  : str(username),
            'class'      : str(class_),
            "id"         : str(tweet.id),
            "text"       : str(text),
            "author_id"  : str(tweet.author_id),
            "created_at" : str(tweet.created_at)
        })
    return tweets


In [115]:
username = 'cryptocom'
if username in df['user_name'].unique():
    print("Error:  User name is in the original dataset. Test a different user.")
else:
    tweets = get_tweets(username, 'unknown', 50)
    if len(tweets) > 0:
        df_new = pd.DataFrame.from_dict(tweets)
        preprocessing(df_new)
        df_new = df_new.groupby(['user_name', 'class']).agg({'text_tokenized': 'sum'}).reset_index()

        baseline_model = ComplementNB()
        baseline_model.fit(X_train_vectorized, y_train)

        tf1_new = TfidfVectorizer(analyzer='word', tokenizer=dummy_fun, 
                                preprocessor=dummy_fun, token_pattern=None, 
                                ngram_range=(1,3), min_df=2, max_features=750, vocabulary=tfidf.vocabulary_)
        df_new_vectorized = tf1_new.fit_transform(df_new['text_tokenized'])

        category1 = baseline_model.predict(df_new_vectorized)[0]
        print(username, ":  ", category1, sep="")
    else:
        print('Tweets were not returned.')

cryptocom:  Entertainment


## Model 2 - Political affiliation

### Pre-processing 

**Warning** This code performs all pre-processing, including lemmatization of the tweet text.  As such, it takes a few minutes to run.  

In [9]:
# Make a copy of the df, leave the original untouched
df_pp = df.copy()
preprocessing(df_pp)
df_pp

KeyboardInterrupt: 

Make sure there's no nulls after processing

In [116]:
df_pp.isna().sum()

user_name         0
class             0
id                0
text              0
author_id         0
created_at        0
RT_user           0
text_tokenized    0
dtype: int64

First, let's try to predict the primary interest of the user between our main classifications:
- Politics
- Sports
- Entertainment
- Business and finance
- Travel
- Science / Technology

In [117]:
df_pp = df_pp.loc[(df_pp['class'] == 'Politics - Conservative') | (df_pp['class'] == 'Politics - Liberal'), 'class']
# df_pp.loc[(df_pp['class'] == 'Music') | (df_pp['class'] == 'TV / movies'), 'class'] = 'Entertainment'
# df_pp.loc[(df_pp['class'] == 'Business and finance'), 'class'] = 'Business'
df_pp['class'].value_counts()

KeyError: 'class'

Aggregate all text words by account

In [40]:
df_model = df_pp.groupby(['user_name', 'class']).agg({'text_tokenized': 'sum'}).reset_index()
df_model

Unnamed: 0,user_name,class,text_tokenized
0,20thcentury,Entertainment,"[titanic, sail, back, theater, valentine, day, weekend, 25th, anniversary, #theacademy..."
1,9to5mac,Science / Technology,"[9to5toys, last, call, eve, room, homekit, air, quality, monitor, mophie, snap, magsaf..."
2,ABCNetwork,Entertainment,"[even, betty, think, will, slip, miss, allnew, episode, willtrent, tonight, 109c, abc,..."
3,AOC,Politics,"[excite, humble, share, even, select, serve, repraskins, house, oversight, committee, ..."
4,Acyn,Politics,"[chad, comer, appear, coown, property, james, comer, receive, small, amount, covid, mo..."
...,...,...,...
150,thedailybeast,Politics,"[favorite, part, jimmykimmel, ask, first, guest, pamela, anderson, ever, meet, mypillo..."
151,travelchannel,Travel,"[late, episode, kindredspirits, u, like, miss, it, stream, discoveryplus, amybruni, ad..."
152,travelocity,Travel,"[sometimes, hard, part, travel, start, pack, process, weve, do, hard, part, ya, tag, f..."
153,wbpictures,Entertainment,"[plan, up, up, away, dcstudios, dcu, dccomics, #jamesgunn#, time, grow, up, shazam, fu..."


In [41]:
df_model['class'].value_counts()

Politics                56
Entertainment           50
Sports                  24
Business                10
Science / Technology     9
Travel                   6
Name: class, dtype: int64

In [42]:
df_model['count_words'] = df_model['text_tokenized'].apply(len)
df_model

Unnamed: 0,user_name,class,text_tokenized,count_words
0,20thcentury,Entertainment,"[titanic, sail, back, theater, valentine, day, weekend, 25th, anniversary, #theacademy...",6348
1,9to5mac,Science / Technology,"[9to5toys, last, call, eve, room, homekit, air, quality, monitor, mophie, snap, magsaf...",8886
2,ABCNetwork,Entertainment,"[even, betty, think, will, slip, miss, allnew, episode, willtrent, tonight, 109c, abc,...",5739
3,AOC,Politics,"[excite, humble, share, even, select, serve, repraskins, house, oversight, committee, ...",7371
4,Acyn,Politics,"[chad, comer, appear, coown, property, james, comer, receive, small, amount, covid, mo...",5426
...,...,...,...,...
150,thedailybeast,Politics,"[favorite, part, jimmykimmel, ask, first, guest, pamela, anderson, ever, meet, mypillo...",7388
151,travelchannel,Travel,"[late, episode, kindredspirits, u, like, miss, it, stream, discoveryplus, amybruni, ad...",7620
152,travelocity,Travel,"[sometimes, hard, part, travel, start, pack, process, weve, do, hard, part, ya, tag, f...",10614
153,wbpictures,Entertainment,"[plan, up, up, away, dcstudios, dcu, dccomics, #jamesgunn#, time, grow, up, shazam, fu...",6226


In [43]:
df_model['class'].value_counts(normalize=True)

Politics                0.361290
Entertainment           0.322581
Sports                  0.154839
Business                0.064516
Science / Technology    0.058065
Travel                  0.038710
Name: class, dtype: float64

Aggregate word count by class

In [44]:
df_model_by_class = df_model.groupby(['class']).agg({'count_words': 'sum'}).reset_index()
df_model_by_class

Unnamed: 0,class,count_words
0,Business,110475
1,Entertainment,232995
2,Politics,370771
3,Science / Technology,109941
4,Sports,107750
5,Travel,64818


### Train-test-split

In [90]:
X = df_model['text_tokenized']
y = df_model['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.35, stratify=y)

### Dummy Classifier

Use Dummy Classifier to predict most frequent label

In [91]:
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X, y)
dummy_clf.predict(X)[0]

'Politics'

Get accuracy of the dummy classifier

In [92]:
dummy_clf.score(X, y)

0.36129032258064514

### Multinomial Naive Bayes Classifier

Use Tfidfvectorizer to vectorize the tweet text

In [110]:
# Instantiate a vectorizer with max_features=10
# (we are using the default token pattern)
tfidf = TfidfVectorizer(analyzer='word', tokenizer=dummy_fun, 
                        preprocessor=dummy_fun, token_pattern=None, 
                        ngram_range=(1,3), min_df=2, max_features=750)

# Fit the vectorizer on X_train["text"] and transform it
X_train_vectorized = tfidf.fit_transform(X_train)

Use a Complement Naive Bayes Classifier

In [111]:
# Import relevant class and function
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

# Instantiate a MultinomialNB classifier
baseline_model = ComplementNB()

# Evaluate the classifier on X_train_vectorized and y_train
baseline_cv = cross_val_score(baseline_model, X_train_vectorized, y_train)
baseline_cv.mean()



0.95

In [112]:
# Fit the vectorizer on X_train["text"] and transform it
X_test_vectorized = tfidf.transform(X_test)

In [113]:
# Evaluate the classifier on X_train_vectorized and y_train
baseline_cv = cross_val_score(baseline_model, X_test_vectorized, y_test)
baseline_cv.mean()



0.7818181818181819

In [114]:
bearer_key = 'AAAAAAAAAAAAAAAAAAAAAAP3lAEAAAAAWiRYIS1QJmco7YZB4oL%2BhLg1R3c%3DmvYmGNwcKhY145AcnvJzFaJlMZ2G7aeovV9VFB5qG9NiNkizEm'

def get_tweets(username, class_, number_of_tweets):
    # This is the key to use to download the tweets
   
    client = tweepy.Client(bearer_token=bearer_key)
    user_id = client.get_user(username=username).data.id

    # Uses the paginator to request as many tweets as we want (paginator makes it possible to download more than 100 at a time
    tweets = []
    for tweet in tweepy.Paginator(client.get_users_tweets, user_id, tweet_fields=['created_at', 'author_id'],expansions=[''], max_results=100, exclude=['replies']).flatten(limit=number_of_tweets):
        # Scrub the text of any non-readable characters
        text = "".join(i for i in tweet.text if i in string.printable)
        # Scrub the text of any newlines
        text = text.replace("\n", " ")
        # Put the tweet info into a new dictionary
        tweets.append({
            "user_name"  : str(username),
            'class'      : str(class_),
            "id"         : str(tweet.id),
            "text"       : str(text),
            "author_id"  : str(tweet.author_id),
            "created_at" : str(tweet.created_at)
        })
    return tweets


In [115]:
username = 'cryptocom'
if username in df['user_name'].unique():
    print("Error:  User name is in the original dataset. Test a different user.")
else:
    tweets = get_tweets(username, 'unknown', 50)
    if len(tweets) > 0:
        df_new = pd.DataFrame.from_dict(tweets)
        preprocessing(df_new)
        df_new = df_new.groupby(['user_name', 'class']).agg({'text_tokenized': 'sum'}).reset_index()

        baseline_model = ComplementNB()
        baseline_model.fit(X_train_vectorized, y_train)

        tf1_new = TfidfVectorizer(analyzer='word', tokenizer=dummy_fun, 
                                preprocessor=dummy_fun, token_pattern=None, 
                                ngram_range=(1,3), min_df=2, max_features=750, vocabulary=tfidf.vocabulary_)
        df_new_vectorized = tf1_new.fit_transform(df_new['text_tokenized'])

        category1 = baseline_model.predict(df_new_vectorized)[0]
        print(username, ":  ", category1, sep="")
    else:
        print('Tweets were not returned.')

cryptocom:  Entertainment
