In [17]:
! pip install nltk
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import wordnet
from nltk import pos_tag, WordNetLemmatizer


df = pd.read_csv("SenateTweets.csv", parse_dates=["TweetDate"])




In [18]:
pd.options.mode.chained_assignment = None

I have added the code above as a way of dealing with writing over data frames as we clean our data. Anytime we want to process a data frame with some cleaning algorithm, we may not care about the old data and so will want to simply write-over the old data frame. Although it is not always a best practice to write over old data with new data, it is often more efficient for memory and so I simply suggest using your own discretion. Using the above code will stop Pandas from printing a warning to this effect.

In [19]:
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Hashtags,TweetID,TweetDate,Retweet,TweetText,UserName,ScreenName,UserID,...,Rep_candidate,Ind_candidate,Winner,Loser,Female_candidate,Male_candidate,Gender_Mentioned,Trump,Biden,prez_mentioned
0,1,5816292,[],1308194337264513024,2020-09-22 00:00:09+00:00,True,Cory Gardner just said he'll vote to fill Ruth...,((( PJ )),pawsupbuttdown,352780384,...,0,0,1,0,0,1,M,0,0,N
1,4,5816295,[],1308194337684107265,2020-09-22 00:00:09+00:00,True,Cory Gardner just said he'll vote to fill Ruth...,brrrgrrr,BurrerMelissa,1254135911396388867,...,0,0,1,0,0,1,M,0,0,N
2,7,5816298,[],1308194337780412416,2020-09-22 00:00:09+00:00,True,Cory Gardner just said he'll vote to fill Ruth...,🌊Blue Tsunami,CeCeChattter,4786062176,...,0,0,1,0,0,1,M,0,0,N
3,10,5816301,[],1308194338061586433,2020-09-22 00:00:09+00:00,True,"@SenCoryGardner FACT CHECK: February 18, 2016 ...",American Nina🌻,Thenina77,2387540972,...,1,0,0,1,0,1,M,0,0,N
4,17,5816308,[],1308194338900455424,2020-09-22 00:00:10+00:00,True,"In Trump they trust. cc: @CoryGardner, @sena...",jckthesword,jckthesword,14616404,...,1,0,1,0,0,1,M,0,0,N


In [20]:
text_col = 'TweetText'
#group_col = 'gender'
#df_text = df[[group_col, text_col]]

df_text = df[[text_col]]

df_text[text_col] = df_text[text_col].replace(to_replace=r'[ , | ? | $ | . | ! | - | : ]' , value = r'', regex = True)
df_text[text_col] = df_text[text_col].replace(to_replace=r'[ ^a-zA-Z ] ', value = r' ', regex = True)
df_text[text_col] = df_text[text_col].replace(to_replace=r'\s\s+' , value = r' ', regex = True)


In the next set of code we first identify the column that contains our grouper like gender. We then identify our column that contains our text data. Once we have identified our two most important columns we create a new data frame of just those columns called df_text.

Finally, I have included 3 different sets of code for doing some initial processing of the text data using Regex functions. The first function replaces funny symbols with nothing in order to remove funny symbols from analysis. You can add more symbols that may be unique to your data set by adding a | and then the symbol after. The second regex function replaces all non-letters with a space. The last regex pattern removes extra blank spaces and replaces them with a single space to ensure that each word only contains one space to the next word. These obviously have overlapping effects so use one, all, or modify to your specific needs.

In [21]:
df_text.head(10)

Unnamed: 0,TweetText
0,CoryGardnerjustsaidhe'llvotetofillRuthBaderGin...
1,CoryGardnerjustsaidhe'llvotetofillRuthBaderGin...
2,CoryGardnerjustsaidhe'llvotetofillRuthBaderGin...
3,@SenCoryGardnerFACTCHECKFebruary182016inYOURow...
4,InTrumptheytrustcc@CoryGardner@senatemajldr@Li...
5,@LindseyGrahamSChttps//tco/gJq3lSlvHc
6,OKAmericaCoryGardnerhasannouncedheisall-infora...
7,@RobDaly19@MNNASS1sun@Jimborobo@LindseyGrahamS...
8,@sebulia1@CatSkoor@commons96055467@LindseyGrah...
9,@LindseyGrahamSCIagreeButtheymadesureyoupaidfo...


Error in data type shown below. We must convert the text column to a string type. 

In [22]:
#df_text = df_text.reset_index
df_text.TweetText[5]

'@LindseyGrahamSChttps//tco/gJq3lSlvHc'

In [25]:
df['TweetText'] = df['TweetText'].astype(str)

In [26]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\student\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [27]:
wnl = WordNetLemmatizer()
stop = set(nltk.corpus.stopwords.words('english'))

In [28]:
operators = set(['not','n/a','na'])
stopwords = stop - operators

In [29]:
def remove_stopwords(tokens, stopwords):
    return [token for token in tokens if token not in stopwords]

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    if treebank_tag.startswith('V'):
           return wordnet.VERB
    if treebank_tag.startswith('N'):
        return wordnet.NOUN
    if treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return 'n'

def lemmarati(tup_list):
    if not (np.all(pd.notnull(tup_list))):
        return tup_list
    outputlist = []
    for i, j in tup_list:
        pos = get_wordnet_pos(i)
        lemma = wnl.lemmatize(i)
        outputlist.append(lemma)
    return outputlist

In the next set of code, we are activating and setting up some functions that will allow us to do some more cleaning and normalizing of the text data. More specifically, the code sets up a function to remove stopwords, or words that are very common and as a result not all that meaningful (e.g. the). The remaining code also performs lemmatization. Lemmatization is a way of normalizing text so that words like Python, Pythons, and Pythonic all become just Python. Thus, lemmatization is like stemming but it takes the part of speech into account so that meet (v) and meeting (n) are kept separate.

Also, note that before defining our stopword list we remove some words that we want to keep in our topic analysis. Words like ‘not’ although often considered a stopword, can be very important when performing topic or sentiment analysis. Consider the difference between ‘happy’ and ‘not happy.’ The latter is the opposite of the former however if we used the nltk stopwords list we would remove ‘not’ from the list and run the risk of thinking most comments were ‘happy’ when in reality they were ‘not happy.’

In [30]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\student\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [31]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\student\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [32]:

df_text[text_col] = df_text[text_col].map(lambda x: nltk.word_tokenize(x.lower()) if (np.all(pd.notnull(x))) else x.lower())



In [33]:
df_text[text_col] = df_text[text_col].map(lambda x: pos_tag(x) if (np.all(pd.notnull(x))) else x)


In [34]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\student\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [35]:
df_text[text_col] = df_text[text_col].map(lemmarati)

In [36]:
df_text[text_col] = df_text[text_col].map(lambda x: remove_stopwords(x,stopwords) if (np.all(pd.notnull(x))) else x)

In [37]:
df_text[text_col] = df_text[text_col].map(lambda x: ' '.join(x) if (np.all(pd.notnull(x))) else x)

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

n_features = 1000
n_topics= 10

tfidf_vec = TfidfVectorizer(max_df = .95, min_df = 2, max_features = n_features, ngram_range = (2,3))

groups = df_text[group_col].unique()
results = []

for i in groups: 
    df_grp = adf_text.loc[adf_text[group_col] == i]
    if len(df_grp[text_col]) > 3:
        tf = tfidf_vec.fit_transform(df_grp[text_col])
        feature_names = tfidf_vec.get_feature_names()
        try:
            nmf = NMF(n_components = n_topics, random_state=1,alpha=.1, l1_ratio=.5).fit(tf)
            df_topics = pd.DataFrame(nmf.components_)
            df_topics.columns = feature_names
            df_top = df_topics.apply(lambda x: pd.Series(x.sort_values(ascending=False).iloc[:5].index,index=['top1','top2','top3','top4','top5']), axis=1).reset_index()
            df_top['Group'] = i
            results.append(df_top)
        except:
            results.append(i+' Did not produce topic results')

In [41]:
results = []
if len(df_text[text_col]) > 3:
    tf = tfidf_vec.fit_transform(df_text[text_col])
    feature_names = tfidf_vec.get_feature_names()
    try:
        nmf = NMF(n_components = n_topics, random_state=1,alpha=.1, l1_ratio=.5).fit(tf)
        df_topics = pd.DataFrame(nmf.components_)
        df_topics.columns = feature_names
        df_top = df_topics.apply(lambda x: pd.Series(x.sort_values(ascending=False).iloc[:5].index,index=['top1','top2','top3','top4','top5']), axis=1).reset_index()
        results.append(df_top)
    except:
        results.append(i+' Did not produce topic results')



In the code above, we first get a list of the unique groups in our grouping column. We then create a container (in this case a list) to hold our resulting data frames from the NMF topic analysis.

In the for loop, we perform a separate NMF analysis for each unique group contained in the grouping column. We use the ‘if len(df_grp[text_col]) > 100’ logic to ensure we have enough rows of text for the analysis. We use the ‘try:’ statement to ensure that the analysis will still run in case one of the groups gives us an error. In the ‘try:’ code we perform the NMF, extract the components into a data frame, label the data frame with the feature names (the bi and trigrams), selecting only the top 5 bi and trigrams for each topic based on their numeric contribution to the topic, add a column to the data frame to keep track of which group the topics are for, and append the results into our results list.

Now we have a list of data frames, which are not useful as a list so one more step before we finish.

In [42]:
topic_results = pd.concat(results,axis=0)
topic_results.to_csv('C:/Users/student/Desktop/my_results.csv')

In [43]:
topic_results.head()

Unnamed: 0,index,top1,top2,top3,top4,top5
0,0,https tco,lindseygrahamsc https,joebiden amp,senatorloeffler senmikelee,justsentanother100to harrisonjaime
1,1,lindseygrahamschttps tco,tco flxakztyeq,agreed lindseygrahamschttps tco,agreed lindseygrahamschttps,lindseygrahamschttps tco flxakztyeq
2,2,harrisonjaimehttps tco,pleasedon tmakeitworseforgrahambygivingtohisop...,stevebullockmtnc calforncsc,stevebullockmtnc calforncsc harrisonjaimehttps,calforncsc harrisonjaimehttps
3,3,amymcgrathkyhttps tco,thefailedvotehimout amymcgrathkyhttps,thefailedvotehimout amymcgrathkyhttps tco,kentuckyyouarethe5thpooreststateintheunionnoma...,kentuckyyouarethe5thpooreststateintheunionnoma...
4,4,itwouldbemeanforpeopletosharethisandtag lindse...,ifyouareoneofthe75millionamericanswhovotedfor ...,it snotjustthatyou,it snotjustscotuschairman lindseygrahamschassc...,it snotjustscotuschairman


In [44]:
topic_results.top3[4]

'it snotjustthatyou'