<a href="https://colab.research.google.com/github/linesn/reddit_analysis/blob/main/Sentiment_Analysis_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis Exercise 
Nicholas Lines  
EN.605.633.81.SP21 Social Media Analytics  

## Introduction
This notebook is my response to the prompt to train a sentiment analysis classifier using [NLTK](https://www.nltk.org/) and [the Sentiment140 corpus](http://help.sentiment140.com/for-students), which was introduced in [1]. As instructed, we'll follow [the tutorial](https://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/) written by Laurent Luce. The result (both of the tutorial and what we'll make here) is a document (tweet) level binary positive/negative classifier using bag-of-words features. The resulting classifier is purely for learning and demonstration purposes, and is not fit for use in real applications -- for real English language sentiment analysis projects I recommend [VADER](https://github.com/cjhutto/vaderSentiment) for lexical rules-based decisions at the sentence or document level, or something like [the Stanford NLP approach](https://nlp.stanford.edu/sentiment/) for supervised modeling. 

## Setting up the environment

In [1]:
%pylab inline
import pandas as pd
import os

Populating the interactive namespace from numpy and matplotlib


In [2]:
try:
  import langdetect
except:
  ! pip install langdetect
  import langdetect

Collecting langdetect
[?25l  Downloading https://files.pythonhosted.org/packages/56/a3/8407c1e62d5980188b4acc45ef3d94b933d14a2ebc9ef3505f22cf772570/langdetect-1.0.8.tar.gz (981kB)
[K     |▍                               | 10kB 14.8MB/s eta 0:00:01[K     |▊                               | 20kB 20.2MB/s eta 0:00:01[K     |█                               | 30kB 9.6MB/s eta 0:00:01[K     |█▍                              | 40kB 7.7MB/s eta 0:00:01[K     |█▊                              | 51kB 7.0MB/s eta 0:00:01[K     |██                              | 61kB 7.5MB/s eta 0:00:01[K     |██▍                             | 71kB 7.9MB/s eta 0:00:01[K     |██▊                             | 81kB 8.2MB/s eta 0:00:01[K     |███                             | 92kB 8.5MB/s eta 0:00:01[K     |███▍                            | 102kB 7.3MB/s eta 0:00:01[K     |███▊                            | 112kB 7.3MB/s eta 0:00:01[K     |████                            | 122kB 7.3MB/s eta 0:00:

In [3]:
from tqdm.notebook import tqdm

In [4]:
import pickle

In [5]:
import nltk

In [7]:
if 'COLAB_GPU' in os.environ: # a hacky way of determining if you are in colab.
  print("Notebook is running in colab")
  from google.colab import drive
  drive.mount("/content/drive")
  DATA_DIR = "drive/MyDrive/Data/raw/"
  
else:
  # Get the system information from the OS
  PLATFORM_SYSTEM = platform.system()

  # Darwin is macOS
  if PLATFORM_SYSTEM == "Darwin":
      EXECUTABLE_PATH = Path("../dependencies/chromedriver")
  elif PLATFORM_SYSTEM == "Windows":
      EXECUTABLE_PATH = Path("../dependencies/chromedriver.exe")
  else:
      logging.critical("Chromedriver not found or Chromedriver is outdated...")
      exit()
  DATA_DIR = "../Data/"

Notebook is running in colab
Mounted at /content/drive


## Getting and preparing the training data
Note that the Sentiment140 data was gathered by querying Twitter for tweets including a given word (e.g. product name) AND emoticons that were used to declare a tweet Positive or Negative. The raw data is kept unchanged except the emoticons are removed. 

In [8]:
if not os.path.exists(DATA_DIR + "/training.1600000.processed.noemoticon.csv"):
  ! wget http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
  ! unzip trainingandtestdata.zip -d $DATA_DIR
  ! ls $DATA_DIR -lrt

The tweets in the corpus are labeled as follows  

| number | meaning  |  
| ------ | -------- |  
| 0      | negative |
| 2      | neutral  |
| 4      | positive |

In practice, though, the data seems to only include the negative and positive tweets. I also notice that there does not appear to be any language filtration in place, so we would want to add that in a real-life application with mixed data.


In [9]:
header = ["polarity", "tweet_id", "date", "query", "user", "text"]
df = pd.read_csv(DATA_DIR+"training.1600000.processed.noemoticon.csv", parse_dates=True, names=header, encoding="latin-1")
df.head()

Unnamed: 0,polarity,tweet_id,date,query,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [10]:
df.shape

(1600000, 6)

In [11]:
df.polarity.value_counts()

4    800000
0    800000
Name: polarity, dtype: int64

In [12]:
lengths = array([len(i) for i in df["text"]])
pd.value_counts(lengths)

138    29850
137    22142
136    18793
48     16652
46     16616
       ...  
243        1
244        1
248        1
252        1
374        1
Length: 257, dtype: int64

In [13]:
df.isnull().any()

polarity    False
tweet_id    False
date        False
query       False
user        False
text        False
dtype: bool

The text encoding (`latin-1`) is inconvenient, so we'll change that.

In [14]:
df.text = df.text.apply (lambda row: row.encode("utf-8", "ignore").decode('utf-8','ignore'))

In [15]:
# df["lang"] = df.text.apply (lambda row: langdetect.detect(row))

In [16]:
negatives = df[df["polarity"]==0].text.to_numpy()
positives = df[df["polarity"]==4].text.to_numpy()

In [17]:
percent_test = .80
plim = int(len(positives) * percent_test)
nlim = int(len(negatives) * percent_test)
positives_train = positives[:plim]
positives_test = positives[plim:]
negatives_train = negatives[:nlim]
negatives_test = negatives[nlim:]

In [18]:
tweets = []
for words in positives_train:
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
    tweets.append((words_filtered, 'positive'))
for words in negatives_train:
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
    tweets.append((words_filtered, 'negative'))
test_tweets = []
for words in positives_test:
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
    tweets.append((words_filtered, 'positive'))
for words in negatives_test:
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
    tweets.append((words_filtered, 'negative'))

In [19]:
def get_words_in_tweets(tweets):
    all_words = []
    for (words, sentiment) in tweets:
      all_words.extend(words)
    return all_words


def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()
    return word_features


def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

In [20]:
word_features = get_word_features(get_words_in_tweets(tweets))

In [21]:
training_set = nltk.classify.apply_features(extract_features, tweets)
testing_set = nltk.classify.apply_features(extract_features, test_tweets)

In [45]:
shortset = training_set[:10] + training_set[int(len(training_set)/2):int(len(training_set)/2+10)]
longset = training_set[:20] + training_set[int(len(training_set)/2):int(len(training_set)/2+20)]

## Building the classifier and Training

In [43]:
%%time
classifier = nltk.NaiveBayesClassifier.train(shortset)

CPU times: user 1min 22s, sys: 1.61 s, total: 1min 24s
Wall time: 1min 24s


In [46]:
%%time
classifier = nltk.NaiveBayesClassifier.train(longset)

CPU times: user 2min 27s, sys: 2.49 s, total: 2min 29s
Wall time: 2min 29s


In [50]:
((2 +27/60)/40)*len(training_set)/60/24/7

9.722222222222223

In [None]:
with open(DATA_DIR+"/classifier.pkl", 'wb') as outfile:
  pickle.dump(classifier, outfile)

In [42]:
classifier.classify(extract_features("this is a dumb sentence".split()))

'negative'

## Testing

In [31]:
classifier.show_most_informative_features()

Most Informative Features
          contains(your) = True           positi : negati =      2.3 : 1.0
          contains(with) = True           negati : positi =      2.3 : 1.0
           contains(you) = True           positi : negati =      1.8 : 1.0
       contains(friends) = True           negati : positi =      1.7 : 1.0
           contains(for) = False          negati : positi =      1.6 : 1.0
           contains(the) = False          negati : positi =      1.4 : 1.0
           contains(you) = False          negati : positi =      1.3 : 1.0
          contains(with) = False          positi : negati =      1.3 : 1.0
          contains(your) = False          negati : positi =      1.3 : 1.0
           contains(who) = False          positi : negati =      1.2 : 1.0


In [41]:
nltk.classify.accuracy(classifier, shortset)
#print(label_probdist.prob('positive'))

1.0

## Results

## References
[1] Go, Alec, Richa Bhayani, and Lei Huang. "Twitter sentiment classification using distant supervision." CS224N project report, Stanford 1.12 (2009): 2009.
