# Question 1 : Logistic Regression


## Import functions and data

In [20]:
# run this cell to import nltk
import numpy as np
import pandas as pd
import nltk
from os import getcwd
import re
import string

### Imported functions

Download the data needed for this assignment. Check out the [documentation for the twitter_samples dataset](http://www.nltk.org/howto/twitter.html).

* twitter_samples and stopwords: While running on a local computer you need to download them using
```Python
nltk.download('twitter_samples')
nltk.download('stopwords')
```

#### Import some helper functions that we provided in the utils.py file:
* `clean_tweet()`: cleans, tokenizes, removes stopwords, and converts words to stems.
* `build_frequency()`: this counts how often a word in the the entire set dataset of tweets was associated with a positive label '1' or a negative label '0', then builds the `frequency_words` dictionary, where each key is a (word,label) tuple, and the value is the count of its frequency within the corpus of tweets.
* The `frequency_words` dictionary is the frequency dictionary that's being built. 

In [21]:
nltk.download('twitter_samples')
nltk.download('stopwords')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [22]:
# this code allows us to prevent downloading data again while refreshing our workspace
filePath = f"{getcwd()}/../temp/"
nltk.data.path.append(filePath)

In [23]:
print(filePath)

/content/../temp/


### Data processing
* The `twitter_samples` contains subsets of 5,000 positive tweets, 5,000 negative tweets, and the full set of 10,000 tweets.  

In [24]:

from nltk.corpus import twitter_samples 
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer



In [25]:





def clean_tweet(tweet):
    
    # tweets_clean: a list of words containing the processed tweet
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market symbols like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in string.punctuation and  
                word not in stopwords_english): 
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean


def build_frequency(tweets, y_np):
 
    # Convert np array to list since zip needs an iterable.
    # The squeeze is necessary or the list ends up with one element.
    yslist = np.squeeze(y_np).tolist()

    # Start with an empty dictionary and populate it by looping over all tweets
    # and over all processed words in each tweet.
    freqs = {}
    for i in range(len(tweets)):
        tweet = tweets[i]
        y = yslist[i]
        for word in clean_tweet(tweet):
            pair = (word, y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1

    return freqs

In [26]:
# select the set of positive and negative tweets
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

In [27]:
print(positive_tweets[7])
print(negative_tweets[10])

@Impatientraider On second thought, there’s just not enough time for a DD :) But new shorts entering system. Sheep must be buying.
I have a really good m&amp;g idea but I'm never going to meet them :(((


### Feature Extraction

* Given a list of tweets, extract the features and store them in a matrix. You will extract two features.
    * The first feature is the number of positive words in a tweet.
    * The second feature is the number of negative words in a tweet. 
    * For each word, check the `frequency_words` dictionary for the count when that word has a positive '1' label. (Check for the key (word, 1.0)
    * Do the same for the count for when the word is associated with the negative label '0'. (Check for the key (word, 0.0).)

In [28]:
def extract_features(tweet, freqs):
    
    # clean_tweet tokenizes, stems, and removes stopwords
    word_l = clean_tweet(tweet)
    
    # 3 elements in the form of a 1 x 3 vector
    x = np.zeros((1, 3)) 
    x[0,0] = 1 
    
    for word in word_l:
        
        # increment the word count when the  label is positive 
        x[0,1] = x[0,1] + freqs.get((word, 1.0),0)
        # increment the word count when the  label is negative 
        x[0,2] = x[0,2] + freqs.get((word, 0.0),0)
        
    return x

#### Instructions: Write `sigmoid`
Finds the sigmoid of z 

In [29]:
def sigmoid(z): 
    
    # z is an input which can be a scalar or an array and h is the sigmoid of z 
    # write the formula for sigmoid here and assign it to h
    h = 0
    return h

#### Instructions: Write `predict_positivity_score`
Predict whether a tweet is positive or negative.

* Given a tweet, process it, then extract the features.
* Apply the model's learned weights on the features to get the y.
* Apply the sigmoid to the y to get the prediction (a value between 0 and 1).


In [30]:
def predict_positivity_score(tweet, freqs, theta):
    
    
    # extracting features from tweet and the frequencies, this x will multiply with the coefficients which are passed to the 
    # sigmoid 
    x = extract_features(tweet,freqs)
    
    # make the prediction using x and theta
    # you need to make calculations for y_pred here. You may need to call sigmoid function here
    y_pred = 0
    
    return y_pred

Note that the `frequency_words` dictionary should be based on the training data and training labels. Here we have done this for a few number of data points


The given function `clean_tweet()` makes tokens from words and applies stemming (producing some variant of a root/base word) and removes stop words (commonly used words such as "the" ,"a" ,"an" among other words)

In [31]:
#IMPLEMENT gradient descient here. 
# alpha is the learning rate 
# x is the data and y is hte label 
# theta is the initial parameter values 
# num_iters is the number of iterations you want the algorithm to run
def gradientDescent(x, y, theta, alpha, num_iters):
    
    # list_of_loss_values is the loss for each iteration which , same is the case with training accuracy
    return J, theta,list_of_loss_values,list_of_training_accuracy

* Train test split: 25% will be in the test set, and 75% in the training set.

# Example
 # Here we show how to call these methods for a few data points. You  may have to use similar calls to the training data after you make the test train split.

In [32]:
some_number_of_tweets = positive_tweets[0:10] + negative_tweets[0:10]
some_number_of_labels = np.append(np.ones((len(positive_tweets[0:10]), 1)), np.zeros((len(negative_tweets[0:10]), 1)), axis=0)

In [33]:
frequency_words = build_frequency(some_number_of_tweets, some_number_of_labels)

In [34]:

# Random tests, you can remove these if you want later , but it may help in testing the code 
print(extract_features(some_number_of_tweets[6], frequency_words))
# test 2:
# check for when the words are not in the frequency_words dictionary
print(extract_features('lalalalala blahblahblah bobobobobbob', frequency_words))



[[ 1. 19.  0.]]
[[1. 0. 0.]]


In [35]:
# NOTE : call gradient descent to get coefficents and then pass that coefficents into predict function 
# something like : predict_positivity_score(tweet, frequency_words, coefficents)

## Testing your model
After training your model using the training set above, check how your model might perform on real, unseen data, by testing it against the test set.

#### Implement `test_logistic` 
* Given the test data and the weights of your trained model, calculate the accuracy of your logistic regression model. 
* Use your `predict_positivity_score()` function to make predictions on each tweet in the test set.
* If the prediction is >= 0.5, set the model's the predicted label is 1 otherwise it is 0.  


In [36]:
# Testing your model on the test set
def test_logistic(test_x, test_y, freqs, theta):
    
    #use your trained model to make predictions and then compare those predictions with the 
    # actual values to come up with an accuracy. and return this accuracy
    
    accuracy = 0
    return accuracy

In [37]:
# Use your model to predict what these result in, whether it is a positive or negative sentiment. If possible, feel free to give
# an intuitive explanation (short explanation) of the scores that you obtained

my_tweet = ['Let that sink in',
            'My psychiatrist told me I was crazy and I said I want a second opion. He said okay, you are ugly too ',
            'I’d rather have a drink with Mel Gibson in his hotel tonight than Bill Cosby.',
            'Building trust is the key to success in any relationship. Excuses, irregularity, chronically late, etc., are the ingredients to kill the TRUST.',
            'We are best friends. Always remember that if you fall i will pick you up. After I finish laughing'
           ]

