# Overview

The goal of this project is to perform 

The goal of this project is to perform a sentiment analysis of Apple customers, and uncover actionable insight that could be used to optimize a marketing strategy going forward. To achieve this, we built a predictive model using Natural Language Processing (NLP), that could rate the sentiment of a tweet based on its content. At the end of our analysis, we present the findings of our model and provide concrete recommendations as to how Apple could improve its marketing strategy going forward and ultimately increase customer satisfaction.

# Business Understanding

Developing an excellent marketing strategy is crucial 


Developing an excellent marketing strategy is crucial for an organization to consistently achieve positive results. To perform effective marketing, companies need to gain a deep understanding of their customers and uncover what matters to them most. The challenge is figuring out how to gain this insight in an efficient manner, and how to consistently implement meaningful change. Fortunately, machine learning provides us with unique and effective tools to perform customer sentiment analysis and guide long-term decision making.

# Data Understanding


For this analysis, I utilized tweet data from 115,511 tweets from 587 Twitter accounts that were pulled from the Twitter API.  These accounts were manually selected by me to represent each account class that I am trying to predict.    

## Imports / settings

In [130]:
# General imports
import string

# Analysis imports
import pandas as pd
import numpy as np

# NLP imports
from nltk.stem import WordNetLemmatizer 
from nltk import pos_tag
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import regexp_tokenize, word_tokenize, RegexpTokenizer

# SKlearn imports
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import ComplementNB
from sklearn.preprocessing import LabelEncoder

# Visualization imports
import matplotlib.pyplot as plt
import seaborn as sns

# Pandas settings
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 90

# Downloads (for NLP)
nltk.download('wordnet')
nltk.download('tagsets')
nltk.download('averaged_perceptron_tagger');

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\natek\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\natek\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\natek\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Functions

These are helper functions that assist in the manipulation of tweet strings for pre-processing purposes.

In [55]:
def strip_rt_user(text):
    if text[0:2] == "RT":
        colon = text.find(":")
        return text[colon+1:].lower()
    else:
        return text.lower()

def get_rt_user(text):
    if text[0:2] == "RT":
        colon = text.find(":")
        user = text[:colon]
        at = user.find("@")
        return (user[at+1:]).lower()
    else:
        return ""

def addHashTags(text):
    return "#" + text + "#"

# Translate nltk POS to wordnet tags
def get_wordnet_pos(treebank_tag):
    '''
    Translate nltk POS to wordnet tags
    '''
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def remove_characters(text, char_to_remove):
    str1 = ''.join(x for x in text if not x in char_to_remove)
    return str1

def remove_punctuation(text):
    text = remove_characters(text, string.punctuation)
    return text

## Data Collection

Data collection methods and code is located in a separate notebook linked ([here](notebook_02_data_collection.ipynb)).

## Load tweet data

Load the tweet data from file.

In [93]:
# Load tweets from file
df = pd.read_csv('tweet_list.csv')

# Format all series as strings
for n in df.columns:
    df[n] = df[n].astype(str)

# Check out the data
df.head()

Unnamed: 0,user_name,class,id,text,author_id,created_at
0,BennieGThompson,Politics - Liberal,1620584010991939584,"Today marks the 83rd anniversary of the first ever #SocialSecurity check, and Republic...",82453460,2023-02-01 00:45:11+00:00
1,BennieGThompson,Politics - Liberal,1620116251749269511,RT @VP: President Biden and I are just getting started. https://t.co/gLmNbpKGAN,82453460,2023-01-30 17:46:29+00:00
2,BennieGThompson,Politics - Liberal,1620116182618759168,"RT @RepJeffries: We will never negotiate away the health, safety or economic well-bein...",82453460,2023-01-30 17:46:12+00:00
3,BennieGThompson,Politics - Liberal,1620116109864357888,https://t.co/Ze7ePCUJJ2,82453460,2023-01-30 17:45:55+00:00
4,BennieGThompson,Politics - Liberal,1620061909113516036,https://t.co/ley5hNsz0y https://t.co/RFdTeGXGO1,82453460,2023-01-30 14:10:33+00:00


## Data cleaning

**Check for nulls**

In [94]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115511 entries, 0 to 115510
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   user_name   115511 non-null  object
 1   class       115511 non-null  object
 2   id          115511 non-null  object
 3   text        115511 non-null  object
 4   author_id   115511 non-null  object
 5   created_at  115511 non-null  object
dtypes: object(6)
memory usage: 5.3+ MB


Notes:
- There are no null values, which makes sense because I downloaded this data myself. 

**Check for duplicates**

In [95]:
df.duplicated().sum()

877

Notes:
- I have some duplicate tweets.  As I noted in the data collection notebook, I must have downloaded some tweets from the same account multiple times while performing the download function. 

**Drop duplicates**

In [96]:
df = df.drop_duplicates()
df.duplicated().sum()

0

Notes:
- Duplicates have been deleted.

## Data review

Check class balance at the tweet level

In [97]:
df['class'].value_counts()

Politics - Conservative    31032
Politics - Liberal         26998
TV / movies                12007
Sports                     12000
Music                      11600
Business and finance        8452
Science / Technology        7550
Travel                      4995
Name: class, dtype: int64

Notes: 
- It's imbalanced but I'm going to leave it and see if we can still make predictions from the data we have

# Modeling

## Pre-processing 

In [98]:
# Make a copy of the df, leave the original untouched
df_pp = df.copy()

In [99]:
# Constants 
min_word_size = 0
min_word_count = 0

**Retweets**

If someone retweets another user, it is highly possible that the account they are retweeting is the same classification.  For example, politicians often retweet other politicians they agree with.  I am highlighting the similarity by making the account name special (surrounding it with hashtags) and i will add it together with the text to include it as a token that the model uses.  

In [100]:
# Copy the RT user name from the text column and put it into a different column.
df_pp['RT_user'] = df_pp['text'].apply(get_rt_user)
df_pp['RT_user'] = df_pp['RT_user'].apply(lambda x: addHashTags(x) if x != "" else "")

# Pull out the RT user name from the text column
df_pp['text'] = df_pp['text'].apply(strip_rt_user)
df_pp.head()

Unnamed: 0,user_name,class,id,text,author_id,created_at,RT_user
0,BennieGThompson,Politics - Liberal,1620584010991939584,"today marks the 83rd anniversary of the first ever #socialsecurity check, and republic...",82453460,2023-02-01 00:45:11+00:00,
1,BennieGThompson,Politics - Liberal,1620116251749269511,president biden and i are just getting started. https://t.co/glmnbpkgan,82453460,2023-01-30 17:46:29+00:00,#vp#
2,BennieGThompson,Politics - Liberal,1620116182618759168,"we will never negotiate away the health, safety or economic well-being of the america...",82453460,2023-01-30 17:46:12+00:00,#repjeffries#
3,BennieGThompson,Politics - Liberal,1620116109864357888,https://t.co/ze7epcujj2,82453460,2023-01-30 17:45:55+00:00,
4,BennieGThompson,Politics - Liberal,1620061909113516036,https://t.co/ley5hnsz0y https://t.co/rfdtegxgo1,82453460,2023-01-30 14:10:33+00:00,


Check new RT user column

In [101]:
df_pp.RT_user.value_counts()

                     91126
#jesseprimetime#       248
#dineshdsouza#         228
#housegop#             219
#foxnews#              193
                     ...  
#calpine#                1
#cryptosavingexp#        1
#matthewbevan#           1
#newschannelnine#        1
#repjohnjoyce#           1
Name: RT_user, Length: 7885, dtype: int64

Notes:
- Successfull extracted the RT user names from the text column

Lower the case, strip out links, and strip any excess whitespace

In [102]:
# Lower case the text tweets
df_pp['text'] = df_pp['text'].str.lower()

# Strip out the meaningless links
df_pp['text'] = df_pp['text'].apply(lambda x: " ".join([n for n in x.split() if n[0:4] != "http"]))

# Strip any excess white space
df_pp['text'] = df_pp['text'].apply(lambda x: x.strip())

df_pp.head()

Unnamed: 0,user_name,class,id,text,author_id,created_at,RT_user
0,BennieGThompson,Politics - Liberal,1620584010991939584,"today marks the 83rd anniversary of the first ever #socialsecurity check, and republic...",82453460,2023-02-01 00:45:11+00:00,
1,BennieGThompson,Politics - Liberal,1620116251749269511,president biden and i are just getting started.,82453460,2023-01-30 17:46:29+00:00,#vp#
2,BennieGThompson,Politics - Liberal,1620116182618759168,"we will never negotiate away the health, safety or economic well-being of the american...",82453460,2023-01-30 17:46:12+00:00,#repjeffries#
3,BennieGThompson,Politics - Liberal,1620116109864357888,,82453460,2023-01-30 17:45:55+00:00,
4,BennieGThompson,Politics - Liberal,1620061909113516036,,82453460,2023-01-30 14:10:33+00:00,


Take out stop words, punctuation and numbers.  Then put RT user into text column.

In [113]:
# Take out stop words
sw = set(stopwords.words('english'))
sw.update(['amp'])
df_pp['text'] = df_pp['text'].apply(lambda x: " ".join([n for n in x.split() if n not in sw]))

# Remove punctuation
df_pp['text'] = df_pp['text'].apply(lambda x: remove_punctuation(x))

# Make sure we don't have any random numbers
df_pp['text'] = df_pp['text'].apply(lambda x: " ".join([n for n in x.split() if n.isnumeric() == False]))

# Put together the RT user and the tweet text
df_pp['text'] = df_pp['text'] + " " + df_pp['RT_user'] + " #" + df_pp['user_name'] + "#"

df_pp.head()

Unnamed: 0,user_name,class,id,text,author_id,created_at,RT_user,text_tokenized
0,BennieGThompson,Politics - Liberal,1620584010991939584,today marks 83rd anniversary first ever socialsecurity check republicans celebrating t...,82453460,2023-02-01 00:45:11+00:00,,"[today, marks, 83rd, anniversary, first, ever, socialsecurity, check, republicans, cel..."
1,BennieGThompson,Politics - Liberal,1620116251749269511,president biden getting started vp vpBennieGThompson #vp# #BennieGThompson#,82453460,2023-01-30 17:46:29+00:00,#vp#,"[president, biden, getting, started, vp, #vp#BennieGThompson]"
2,BennieGThompson,Politics - Liberal,1620116182618759168,never negotiate away health safety economic wellbeing american people repjeffries repj...,82453460,2023-01-30 17:46:12+00:00,#repjeffries#,"[never, negotiate, away, health, safety, economic, wellbeing, american, people, repjef..."
3,BennieGThompson,Politics - Liberal,1619330126361300993,footage tyre nichols killing painful send condolences family friends justice must serv...,82453460,2023-01-28 13:42:42+00:00,,"[footage, tyre, nichols, killing, painful, send, condolences, family, friends, justice..."
4,BennieGThompson,Politics - Liberal,1619327606159179777,happy birthday chairman congressman benniegthompson 30th yr congress celebrate incredi...,82453460,2023-01-28 13:32:41+00:00,#cbcinstitute#,"[happy, birthday, chairman, congressman, benniegthompson, 30th, yr, congress, celebrat..."


Tokenize the tweet text

In [114]:
# Make a new column, tokenize the words
df_pp['text_tokenized'] = df_pp['text'].str.split()

df_pp.head()

Unnamed: 0,user_name,class,id,text,author_id,created_at,RT_user,text_tokenized
0,BennieGThompson,Politics - Liberal,1620584010991939584,today marks 83rd anniversary first ever socialsecurity check republicans celebrating t...,82453460,2023-02-01 00:45:11+00:00,,"[today, marks, 83rd, anniversary, first, ever, socialsecurity, check, republicans, cel..."
1,BennieGThompson,Politics - Liberal,1620116251749269511,president biden getting started vp vpBennieGThompson #vp# #BennieGThompson#,82453460,2023-01-30 17:46:29+00:00,#vp#,"[president, biden, getting, started, vp, vpBennieGThompson, #vp#, #BennieGThompson#]"
2,BennieGThompson,Politics - Liberal,1620116182618759168,never negotiate away health safety economic wellbeing american people repjeffries repj...,82453460,2023-01-30 17:46:12+00:00,#repjeffries#,"[never, negotiate, away, health, safety, economic, wellbeing, american, people, repjef..."
3,BennieGThompson,Politics - Liberal,1619330126361300993,footage tyre nichols killing painful send condolences family friends justice must serv...,82453460,2023-01-28 13:42:42+00:00,,"[footage, tyre, nichols, killing, painful, send, condolences, family, friends, justice..."
4,BennieGThompson,Politics - Liberal,1619327606159179777,happy birthday chairman congressman benniegthompson 30th yr congress celebrate incredi...,82453460,2023-01-28 13:32:41+00:00,#cbcinstitute#,"[happy, birthday, chairman, congressman, benniegthompson, 30th, yr, congress, celebrat..."


Notes:
- After taking away some of the words and links, there are some tweets with no text data.  Since there are no words to use in the model, I'll delete those tweets.

Delete tweets with no words after pre-processing

In [125]:
df_pp['text'] = df_pp['text'].apply(lambda x: np.nan if len(x.strip()) == 0 else x)
df_pp = df_pp.dropna().reset_index(drop=True) 

df_pp.head()

Unnamed: 0,user_name,class,id,text,author_id,created_at,RT_user,text_tokenized
0,BennieGThompson,Politics - Liberal,1620584010991939584,today marks 83rd anniversary first ever socialsecurity check republicans celebrating t...,82453460,2023-02-01 00:45:11+00:00,,"[today, marks, 83rd, anniversary, first, ever, socialsecurity, check, republicans, cel..."
1,BennieGThompson,Politics - Liberal,1620116251749269511,president biden getting started vp vpBennieGThompson #vp# #BennieGThompson#,82453460,2023-01-30 17:46:29+00:00,#vp#,"[president, biden, getting, started, vp, vpBennieGThompson, #vp#, #BennieGThompson#]"
2,BennieGThompson,Politics - Liberal,1620116182618759168,never negotiate away health safety economic wellbeing american people repjeffries repj...,82453460,2023-01-30 17:46:12+00:00,#repjeffries#,"[never, negotiate, away, health, safety, economic, wellbeing, american, people, repjef..."
3,BennieGThompson,Politics - Liberal,1619330126361300993,footage tyre nichols killing painful send condolences family friends justice must serv...,82453460,2023-01-28 13:42:42+00:00,,"[footage, tyre, nichols, killing, painful, send, condolences, family, friends, justice..."
4,BennieGThompson,Politics - Liberal,1619327606159179777,happy birthday chairman congressman benniegthompson 30th yr congress celebrate incredi...,82453460,2023-01-28 13:32:41+00:00,#cbcinstitute#,"[happy, birthday, chairman, congressman, benniegthompson, 30th, yr, congress, celebrat..."


Make sure there's no nulls after deleting null values

In [126]:
df_pp.isna().sum()

user_name         0
class             0
id                0
text              0
author_id         0
created_at        0
RT_user           0
text_tokenized    0
dtype: int64

In [127]:
# Make a new df
df_pp = df_pp.drop(columns=['id', 'user_name', 'author_id', 'created_at', 'RT_user'])
df_pp.head()

Unnamed: 0,class,text,text_tokenized
0,Politics - Liberal,today marks 83rd anniversary first ever socialsecurity check republicans celebrating t...,"[today, marks, 83rd, anniversary, first, ever, socialsecurity, check, republicans, cel..."
1,Politics - Liberal,president biden getting started vp vpBennieGThompson #vp# #BennieGThompson#,"[president, biden, getting, started, vp, vpBennieGThompson, #vp#, #BennieGThompson#]"
2,Politics - Liberal,never negotiate away health safety economic wellbeing american people repjeffries repj...,"[never, negotiate, away, health, safety, economic, wellbeing, american, people, repjef..."
3,Politics - Liberal,footage tyre nichols killing painful send condolences family friends justice must serv...,"[footage, tyre, nichols, killing, painful, send, condolences, family, friends, justice..."
4,Politics - Liberal,happy birthday chairman congressman benniegthompson 30th yr congress celebrate incredi...,"[happy, birthday, chairman, congressman, benniegthompson, 30th, yr, congress, celebrat..."


First, let's try to predict the primary interest of the user between our main classifications:
- Politics
- Sports
- TV / movies
- Business and finance
- Music
- Travel
- Science / Technology

In [128]:
# df3.loc[(df3['class'] == 'Politics - Conservative') | (df3['class'] == 'Politics - Liberal'), 'class'] = 'Politics'
df_pp['class'].value_counts()

Politics - Conservative    30532
Politics - Liberal         26846
TV / movies                11843
Sports                     11633
Music                      10916
Business and finance        8444
Science / Technology        7522
Travel                      4991
Name: class, dtype: int64

Encode the labels into a number

In [131]:
le = LabelEncoder()
df_pp['class_label'] = le.fit_transform(df_pp['class'])
df_pp.head()

Unnamed: 0,class,text,text_tokenized,class_label
0,Politics - Liberal,today marks 83rd anniversary first ever socialsecurity check republicans celebrating t...,"[today, marks, 83rd, anniversary, first, ever, socialsecurity, check, republicans, cel...",3
1,Politics - Liberal,president biden getting started vp vpBennieGThompson #vp# #BennieGThompson#,"[president, biden, getting, started, vp, vpBennieGThompson, #vp#, #BennieGThompson#]",3
2,Politics - Liberal,never negotiate away health safety economic wellbeing american people repjeffries repj...,"[never, negotiate, away, health, safety, economic, wellbeing, american, people, repjef...",3
3,Politics - Liberal,footage tyre nichols killing painful send condolences family friends justice must serv...,"[footage, tyre, nichols, killing, painful, send, condolences, family, friends, justice...",3
4,Politics - Liberal,happy birthday chairman congressman benniegthompson 30th yr congress celebrate incredi...,"[happy, birthday, chairman, congressman, benniegthompson, 30th, yr, congress, celebrat...",3


In [139]:
labels_df = df_pp.groupby(['class', 'class_label'])['class',].count()
labels_df

Unnamed: 0_level_0,Unnamed: 1_level_0,class
class,class_label,Unnamed: 2_level_1
Business and finance,0,8444
Music,1,10916
Politics - Conservative,2,30532
Politics - Liberal,3,26846
Science / Technology,4,7522
Sports,5,11633
TV / movies,6,11843
Travel,7,4991


In [211]:
def tag_and_lemmatize(text):
    newText = text
    newText = pos_tag(newText)
    newText = [(x[0], get_wordnet_pos(x[1])) for x in newText]
    lemma = nltk.stem.WordNetLemmatizer()
    newText = [(lemma.lemmatize(x[0], x[1])) for x in newText]
    return newText


In [212]:
df3['text4'] = df3['text3'].apply(tag_and_lemmatize)

In [213]:
df3['text4'][0]

['today',
 'mark',
 '83rd',
 'anniversary',
 'first',
 'ever',
 'socialsecurity',
 'check',
 'republicans',
 'celebrate',
 'try',
 'cut',
 'vital',
 'program',
 'determine',
 'protect',
 'expand',
 'program',
 'american',
 'worker',
 'pay',
 'every',
 'paycheck']

In [214]:
df3

Unnamed: 0,user_name,class,id,text,author_id,created_at,text2,RT_user,RT_user#,word_count,text3,class_label,text4
0,BennieGThompson,Politics,1620584010991939584,"Today marks the 83rd anniversary of the first ever #SocialSecurity check, and Republic...",82453460,2023-02-01 00:45:11+00:00,today marks 83rd anniversary first ever socialsecurity check republicans celebrating t...,,,180,"[today, marks, 83rd, anniversary, first, ever, socialsecurity, check, republicans, cel...",2,"[today, mark, 83rd, anniversary, first, ever, socialsecurity, check, republicans, cele..."
1,BennieGThompson,Politics,1620116251749269511,RT @VP: President Biden and I are just getting started. https://t.co/gLmNbpKGAN,82453460,2023-01-30 17:46:29+00:00,president biden getting started #vp#,vp,#vp#,36,"[president, biden, getting, started, #vp#]",2,"[president, biden, get, start, #vp#]"
2,BennieGThompson,Politics,1620116182618759168,"RT @RepJeffries: We will never negotiate away the health, safety or economic well-bein...",82453460,2023-01-30 17:46:12+00:00,never negotiate away health safety economic wellbeing american people #repjeffries#,repjeffries,#repjeffries#,83,"[never, negotiate, away, health, safety, economic, wellbeing, american, people, #repje...",2,"[never, negotiate, away, health, safety, economic, wellbeing, american, people, #repje..."
3,BennieGThompson,Politics,1619330126361300993,The footage of Tyre Nichols killing is painful. I send my condolences to his family an...,82453460,2023-01-28 13:42:42+00:00,footage tyre nichols killing painful send condolences family friends justice must serv...,,,101,"[footage, tyre, nichols, killing, painful, send, condolences, family, friends, justice...",2,"[footage, tyre, nichols, kill, painful, send, condolence, family, friends, justice, mu..."
4,BennieGThompson,Politics,1619327606159179777,RT @CBCInstitute: Happy Birthday to our Chairman Congressman @BennieGThompson! In your...,82453460,2023-01-28 13:32:41+00:00,happy birthday chairman congressman benniegthompson 30th yr congress celebrate incredi...,cbcinstitute,#cbcinstitute#,108,"[happy, birthday, chairman, congressman, benniegthompson, 30th, yr, congress, celebrat...",2,"[happy, birthday, chairman, congressman, benniegthompson, 30th, yr, congress, celebrat..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
112722,RepLCD,Politics,1611786100825006080,It was great to catch up with my friend @RepFeenstra last night -- we're ready to get ...,1583530102297600000,2023-01-07 18:05:26+00:00,great catch friend repfeenstra last night were ready get work amp deliver promises mad...,,,114,"[great, catch, friend, repfeenstra, last, night, were, ready, get, work, amp, deliver,...",2,"[great, catch, friend, repfeenstra, last, night, be, ready, get, work, amp, deliver, p..."
112723,RepLCD,Politics,1611615029660639233,Thank you #OR05 for placing your trust in me to represent you in the halls of Congress...,1583530102297600000,2023-01-07 06:45:40+00:00,thank or05 placing trust represent halls congress solemn promise oregonians carry cons...,,,155,"[thank, or05, placing, trust, represent, halls, congress, solemn, promise, oregonians,...",2,"[thank, or05, place, trust, represent, hall, congress, solemn, promise, oregonian, car..."
112724,RepLCD,Politics,1610791524807081986,A small minority is preventing the House from doing the work we were sent here to do. ...,1583530102297600000,2023-01-05 00:13:21+00:00,small minority preventing house work sent do must get economy back track work get cost...,,,157,"[small, minority, preventing, house, work, sent, do, must, get, economy, back, track, ...",2,"[small, minority, prevent, house, work, send, do, must, get, economy, back, track, wor..."
112725,RepLCD,Politics,1610408428052295681,"As I take on the responsibility of serving #OR05, I'm very grateful to have my family ...",1583530102297600000,2023-01-03 22:51:03+00:00,take responsibility serving or05 im grateful family side,,,57,"[take, responsibility, serving, or05, im, grateful, family, side]",2,"[take, responsibility, serve, or05, im, grateful, family, side]"


In [215]:
#df3.to_csv('all_lemmatized.csv')

In [216]:
#df3 = pd.read_csv("all_lemmatized.csv")

In [217]:
df4 = df3.groupby(['user_name', 'class_label', 'class']).agg({'text4': 'sum'}).reset_index()
df4

Unnamed: 0,user_name,class_label,class,text4
0,20thcentury,5,TV / movies,"[titanic, sail, back, theater, valentine, day, weekend, 25th, anniversary, #theacademy..."
1,9to5mac,3,Science / Technology,"[9to5toys, last, call, eve, room, homekit, air, quality, monitor, mophie, snap, magsaf..."
2,ABCNetwork,5,TV / movies,"[even, betty, think, will, slip, miss, allnew, episode, willtrent, tonight, 109c, abc,..."
3,AOC,2,Politics,"[excite, humble, share, even, select, serve, repraskins, house, oversight, committee, ..."
4,Acyn,2,Politics,"[chad, comer, appear, coown, property, james, comer, receive, small, amount, covid, mo..."
...,...,...,...,...
581,travelchannel,6,Travel,"[late, episode, kindredspirits, u, like, miss, it, stream, discoveryplus, amybruni, ad..."
582,travelocity,6,Travel,"[sometimes, hard, part, travel, start, pack, process, weve, do, hard, part, ya, tag, f..."
583,virginiafoxx,2,Politics,"[regular, order, restore, people, house, student, reward, hard, work, education, burea..."
584,wbpictures,5,TV / movies,"[plan, up, up, away, dcstudios, dcu, dccomics, #jamesgunn#, time, grow, up, shazam, fu..."


In [218]:
df4['class'].value_counts()

Politics                487
Music                    29
Sports                   24
TV / movies              21
Business and finance     10
Science / Technology      9
Travel                    6
Name: class, dtype: int64

In [219]:
df4['count_text_4'] = df4['text4'].apply(len)
df4

Unnamed: 0,user_name,class_label,class,text4,count_text_4
0,20thcentury,5,TV / movies,"[titanic, sail, back, theater, valentine, day, weekend, 25th, anniversary, #theacademy...",6348
1,9to5mac,3,Science / Technology,"[9to5toys, last, call, eve, room, homekit, air, quality, monitor, mophie, snap, magsaf...",8886
2,ABCNetwork,5,TV / movies,"[even, betty, think, will, slip, miss, allnew, episode, willtrent, tonight, 109c, abc,...",5739
3,AOC,2,Politics,"[excite, humble, share, even, select, serve, repraskins, house, oversight, committee, ...",7371
4,Acyn,2,Politics,"[chad, comer, appear, coown, property, james, comer, receive, small, amount, covid, mo...",5426
...,...,...,...,...,...
581,travelchannel,6,Travel,"[late, episode, kindredspirits, u, like, miss, it, stream, discoveryplus, amybruni, ad...",7620
582,travelocity,6,Travel,"[sometimes, hard, part, travel, start, pack, process, weve, do, hard, part, ya, tag, f...",10614
583,virginiafoxx,2,Politics,"[regular, order, restore, people, house, student, reward, hard, work, education, burea...",746
584,wbpictures,5,TV / movies,"[plan, up, up, away, dcstudios, dcu, dccomics, #jamesgunn#, time, grow, up, shazam, fu...",6226


In [220]:
df5 = df4.groupby(['class_label', 'class']).agg({'count_text_4': 'sum'}).reset_index()
df5

Unnamed: 0,class_label,class,count_text_4
0,0,Business and finance,110475
1,1,Music,108593
2,2,Politics,908577
3,3,Science / Technology,109941
4,4,Sports,107750
5,5,TV / movies,124402
6,6,Travel,64818


In [343]:
X = df4['text4']
y = df4['class_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.35, stratify=y)

In [344]:
X_train

92     [youtube, kid, groom, child, lgbtq, propaganda, make, account, find, disgust, #aldobut...
318    [thank, national, district, attorney, association, ndaajustice, allow, visit, today, k...
20     [philip, alston, tie, second, left, ramblersmbb, #cbssportscbb#, moment, andy, reid, f...
371    [introduce, first, bill, think, differently, database, act, ny19, ithaca, get, 600k, i...
376    [time, go, work, #reppatfallon#, wed, 8am, live, dc, join, eugenescott, alexi, first, ...
                                                 ...                                            
385    [great, meet, secvetaffairs, discuss, work, together, ensure, veteran, inland, empire,...
227    [vast, majority, american, return, life, normal, yet, admin, lag, behind, repeal, covi...
321    [janschakowsky, replahood, scclemons, replahood, im, glad, speaker, mccarthy, put, sel...
488    [king, quest, inch, closer, cane, come, back, win, ot, another, milestone, king, lebro...
290    [mahalo, georgetakei, s

In [345]:
# Import the relevant vectorizer class
from sklearn.feature_extraction.text import TfidfVectorizer

def dummy_fun(doc):
    return doc

# Instantiate a vectorizer with max_features=10
# (we are using the default token pattern)
tfidf = TfidfVectorizer(analyzer='word', tokenizer=dummy_fun, 
                        preprocessor=dummy_fun, token_pattern=None, 
                        ngram_range=(1,3), min_df=2, max_features=750)

# Fit the vectorizer on X_train["text"] and transform it
X_train_vectorized = tfidf.fit_transform(X_train)

# Visually inspect the vectorized data
pd.DataFrame.sparse.from_spmatrix(X_train_vectorized, columns=tfidf.get_feature_names())

Unnamed: 0,118th,118th congress,able,abortion,access,account,accountable,across,across country,act,...,write,year,year ago,yes,yesterday,yet,york,you,young,youre
0,0.000000,0.000000,0.036783,0.038219,0.000000,0.023027,0.000000,0.011274,0.005856,0.017299,...,0.013109,0.098063,0.014217,0.069034,0.000000,0.060014,0.007498,0.060516,0.026950,0.046241
1,0.049947,0.051239,0.000000,0.000000,0.000000,0.000000,0.000000,0.035495,0.027656,0.081699,...,0.000000,0.134456,0.000000,0.000000,0.023195,0.047238,0.000000,0.040828,0.025455,0.000000
2,0.000000,0.000000,0.010608,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.006616,0.022350,0.000000,0.006968,0.000000,0.005048,0.000000,0.008726,0.043523,0.006668
3,0.021176,0.021724,0.000000,0.000000,0.036651,0.030739,0.021724,0.060197,0.000000,0.013855,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.420369,0.017310,0.000000,0.000000
4,0.186386,0.191205,0.000000,0.000000,0.000000,0.000000,0.031867,0.000000,0.000000,0.040650,...,0.000000,0.074332,0.000000,0.000000,0.057703,0.000000,0.000000,0.050786,0.000000,0.038806
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
375,0.000000,0.000000,0.021818,0.090676,0.000000,0.000000,0.022523,0.062412,0.072940,0.086191,...,0.000000,0.105072,0.019677,0.000000,0.000000,0.041529,0.000000,0.017947,0.000000,0.000000
376,0.013972,0.014334,0.000000,0.000000,0.012091,0.020282,0.071668,0.079437,0.077364,0.054851,...,0.000000,0.050151,0.000000,0.000000,0.012977,0.026429,0.000000,0.011421,0.028483,0.000000
377,0.034442,0.035333,0.000000,0.017781,0.014903,0.000000,0.017666,0.012238,0.019071,0.067606,...,0.021346,0.103019,0.000000,0.000000,0.000000,0.000000,0.000000,0.014077,0.017553,0.000000
378,0.000000,0.000000,0.018218,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.054836,0.010954,0.007978,0.005676,0.011559,0.008665,0.000000,0.012458,0.000000


First try a Naive Bayes Classifier

In [346]:
# Import relevant class and function
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

# Instantiate a MultinomialNB classifier
baseline_model = ComplementNB()

# Evaluate the classifier on X_train_vectorized and y_train
baseline_cv = cross_val_score(baseline_model, X_train_vectorized, y_train)
baseline_cv.mean()



0.9342105263157894

In [347]:
# Fit the vectorizer on X_train["text"] and transform it
X_test_vectorized = tfidf.transform(X_test)

# Visually inspect the vectorized data
# pd.DataFrame.sparse.from_spmatrix(X_test_vectorized, columns=tfidf.get_feature_names())

In [348]:
# Evaluate the classifier on X_train_vectorized and y_train
baseline_cv = cross_val_score(baseline_model, X_test_vectorized, y_test)
baseline_cv.mean()



0.951335656213705