# Exploring Twitter Sentiment

This is my practice working through a set of problems from the [intro to NLP course by Shivam Bansal](https://courses.analyticsvidhya.com/courses/Intro-to-NLP).

The course provides an example table of data containing ~15,000 tweets and their assocaited meta data. First we want to find the most common words in tweets.

In [101]:
import pandas as pd
import numpy as np
import re
import nltk

In [2]:
dF = pd.read_csv('data/tweets.csv', index_col=0, encoding='ISO-8859-1')

print(dF.head(5))

   X                                               text  favorited  \
1  1  RT @rssurjewala: Critical question: Was PayTM ...      False   
2  2  RT @Hemant_80: Did you vote on #Demonetization...      False   
3  3  RT @roshankar: Former FinSec, RBI Dy Governor,...      False   
4  4  RT @ANI_news: Gurugram (Haryana): Post office ...      False   
5  5  RT @satishacharya: Reddy Wedding! @mail_today ...      False   

   favoriteCount replyToSN              created  truncated  replyToSID  \
1              0       NaN  2016-11-23 18:40:30      False         NaN   
2              0       NaN  2016-11-23 18:40:29      False         NaN   
3              0       NaN  2016-11-23 18:40:03      False         NaN   
4              0       NaN  2016-11-23 18:39:59      False         NaN   
5              0       NaN  2016-11-23 18:39:39      False         NaN   

             id  replyToUID  \
1  8.014957e+17         NaN   
2  8.014957e+17         NaN   
3  8.014955e+17         NaN   
4  8.01495

In [3]:
def count_word_freqs(text, n_top = 20):
    """
    Count the frequency of each word and return the most frequent
    
    Parameters: 
    -----------
    text : pandas series object
        Pandas series object (column) of tweet texts
    n_top : int
        the number of most frequent entries to be returned
    
    Returns:
    -----------
    count_freqs : pandas Series
        A pandas series with the most frequent words and the 
        frequency in which they appear
    """
    # stack and split tweets into individual words
    stacked = np.hstack(text.str.split())
    
    count_freqs = pd.Series(stacked).value_counts()[:n_top]
    
    return count_freqs


In [4]:
count_word_freqs(dF.text)

RT                 11053
to                  7650
is                  5152
in                  4491
the                 4331
of                  4053
#Demonetization     3253
demonetization      3162
on                  2751
#demonetization     2474
PM                  2384
Modi                2379
India               2243
and                 2220
a                   2180
that                2168
out                 1729
for                 1672
so                  1599
had                 1598
dtype: int64

Some of these most frequent words don't help us to gauge sentiment on twitter. For instance, RT, to, is, in, the, of, etc. are common but carry very little meaning. We would like to ignore these. Additionally, some words like demonetization appear often but due to different formattings appear multiple times. These should be counted together to get the true count of occurrence. We will now do some text formatting to get better results.

In [89]:
def count_word_freqs(text, n_top=20):
    
    stripped = []
    
    for tweet in text:
        # replace RT with nothing
        tweet = re.sub(r'RT', ' ', tweet)
        
        # replace special characters with nothing
        tweet = re.sub(r'[?!.;:,#@-]', ' ', tweet)
        
        # replace &amp with &
        tweet = re.sub(r'&amp', '&', tweet)
        
        # convert to lowercase for consistency when counting
        tweet = tweet.lower()
        
        # split tweet into individual words and append to master list
        stripped.append(tweet.split())
    
    stacked = np.hstack(stripped)
    count_freqs = pd.Series(stacked).value_counts()[:n_top]
    
    return count_freqs

In [90]:
count_word_freqs(dF.text)

demonetization    14461
to                 7725
https              6514
//t                6142
is                 5370
the                5127
in                 4717
of                 4110
and                2907
on                 2820
modi               2761
india              2759
pm                 2732
a                  2392
that               2202
out                1861
so                 1768
for                1732
by                 1677
who                1624
dtype: int64

We have filtered out some issues here, but now we need to focus on removing words that carry little meaning. We will try again, but this time removing so called stop words that are common and don't carry meaning, using the python package nltk.

In [109]:
# download the stop words
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/pdrew/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [108]:
# print top 20 stop words
print(nltk.corpus.stopwords.words('english')[:20])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']


In [None]:
def count_word_freqs(text, n_top=20):
    
    stripped = []
    
    for tweet in text:
        # replace RT with nothing
        tweet = re.sub(r'RT', ' ', tweet)
        
        # replace special characters with nothing
        tweet = re.sub(r'[?!.;:,#@-]', ' ', tweet)
        
        # replace &amp with &
        tweet = re.sub(r'&amp', '&', tweet)
        
        # convert to lowercase for consistency when counting
        tweet = tweet.lower()
        
        # split tweet into individual words and append to master list
        stripped.append(tweet.split())
    
    stacked = np.hstack(stripped)
    count_freqs = pd.Series(stacked).value_counts()[:n_top]
    
    return count_freqs

this is still a work in progress!