# Sentiment Analysis - Logistic Regression
In this small POC, I will show you how you can apply logistic regression in order to determine if a given tweet has a positive or a negative sentiment behind it.
For this, we will use a dataset included in NLTK package that already contains 10k labeled tweets, 5k of them marked as positive and the other 5k as negative. This is really good, since we will have a perfect distribution of tweet so our model can learn better.

## Reading data and understanding it
Our first step is always getting along with the data, we need to understand what is the format of it and how we can obtain the information that we need to train our model with.

In [15]:
# Import some libraries we always use
import pandas as pd
import numpy as np

from nltk.corpus import twitter_samples


# To know the ids of the json files, we can run the following command
#print(twitter_samples.fileids())

# Reading our files
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
all_positive_tweets = twitter_samples.strings('positive_tweets.json')

# Count how many type of tweets we have in each case
print(f"Amount of positive tweets: {len(all_positive_tweets)}")
print(f"Amount of negative tweets: {len(all_negative_tweets)}")
print()

# Let's display some samples
print("Some positive tweets:")
print(all_positive_tweets[1:10])
print()
print("Some negative tweets:")
print(all_negative_tweets[1:10])


Amount of positive tweets: 5000
Amount of negative tweets: 5000

Some positive tweets:
['@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!', '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!', '@97sides CONGRATS :)', 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days', '@BhaktisBanter @PallaviRuhail This one is irresistible :)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM', "We don't like to keep our lovely customers waiting for long! We hope you enjoy! Happy Friday! - LWWF :) https://t.co/smyYriipxI", '@Impatientraider On second thought, there’s just not enough time for a DD :) But new shorts entering system. Sheep must be buying.', 'Jgh , but we have to go to Bayan :D bye', 'As an act of mischievousness, am calling the ETL layer of our in-house warehousing app Katamari.\n\nWell… as the name im

## Pre-processing and cleaning our data
This is a classic step of the NLP pipeline. In here, we will proceed to clean the text, remove punctuation, tokenize it into separate words, remove stopwords (or most common words in English) and transform them into their root version (stemming).

For this particular case of tweets, we can see that there are some URLS inside of the message that we should also get rid of (since URLS don't add any value in the sentiment of a message). Same thing happens with quotes or tags (we can see in the previos messages some mentions to users such as "@ketchBurning" which won't add any sentiment either).
Let's get rid of all this information and clean our text.

In [23]:
import re                                  # library for regular expression operations
import string                              # for string operations

from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import TweetTokenizer   # module for tokenizing strings

ps = PorterStemmer()
en_stopwords = stopwords.words('english')

def process_tweet(tweet):
    # Remove hashtag, retweet marks, and hyperlinks
    # Remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)

    # Remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)

    # Remove hashtags
    # Only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)  
    
    # Tokenize and lowercase words so ("Hello" and "hello" have the same meaning)
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,reduce_len=True)
    tweet_tokenized = tokenizer.tokenize(tweet)
    
    # Remove punctuation, stopwords and stem token
    tweet_tokenized_cleaned = []
    for token in tweet_tokenized:
        if token not in string.punctuation and token not in en_stopwords:
            tweet_tokenized_cleaned.append(ps.stem(token))
    
    return tweet_tokenized_cleaned
    

# Testing it out
sample_tweet = all_positive_tweets[0]
processed_tweet = process_tweet(sample_tweet)
print(f"Sample Tweet: {sample_tweet}")
print(f"Processed Tweet: {processed_tweet}")

Sample Tweet: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
Processed Tweet: ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']


## What can we use to train our model with?
Well, we now have the capability of processing each one of our tweets. However, we need to think what type of information we're going to use in our model in order for it to determine if a given token has a positive or a negative sentiment.

One thing that is clear is that we want a vector representation for each one of our tweets, and we want that representation be as optimal as possible. This is called feature extraction.
What we could do in this case is start by counting how many times each word appears in possitive tweets and how many times it appears in negative ones. We will then generate a method that can give us a dictionary with (word,label) as keys and then as value we will have the total times that word appeared in messages with that label.

As an example, we might have:
("week", 1) -> 15
("week", 0) -> 5
Meaning that the token "week" appeared 15 times in possitive messages (that's why we use 1 as part of the key) and 5 times in tweets labeled as negative. We will use this dictionary to later build our features vector (this is, the vector we will use as input to our model).

In [24]:
# as input, this will receive all the tweets and labels (1 or 0) we use to train our model with
def frequency_dictionary(tweets,labels):  
    # I will zip the tweets and the labels so we can get a tuple representation of each tweet and his value
    tweets_labels = zip(tweets,labels)
    freq_dict = {}
    
    for (tweet,label) in tweets_labels:
        tweet_tokens = process_tweet(tweet)
        for token in tweet_tokens:
            pair = (token,label)
            if pair in freq_dict:
                freq_dict[pair] +=1
            else:
                freq_dict[pair] = 1
                
# Let's try this with 50 messages from each 
