# ECE W382V MACHINE PROGRAMMING

# Homework 1 - due Sunday July 9 2022 at 11:59pm

For this homework you will hand in (upload) to canvas:
- a notebook renamed ``hw1_YourEID.ipynb``

__Before submitting__, please reset your kernel and rerun everything from the beginning (`Kernel` >> `Restart and Run All`) and ensure your code does not give ANY error. 

For programming tasks, make sure that your code can run using Python 3.5+. If you cannot complete a problem, include as much pseudocode as possible in the form of **python comment** for partial credit.  **Please do NOT leave imcomplete code in your homework, please wrap them up in the comment.**

Collaboration: you are free to discuss the homework assignments with other students. However, all the code you write must be your own!

Information can be found about Markdown cells for Jupyter Notebooks here: https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html


### Please list any person you discussed this assignment with:
- 




## Problem 1: the paradox of induction (15 points)

Consider a statement whose truth is unknown. If we see many examples that are compatible with it, we are tempted to view the statement as more probable. Such reasoning is often referred to as _inductive inference_ (in a philosophical, rather than mathematical sense). Consider now the statement that "all cows are white". An equivalent statement is that "everything that is not white is not a cow". We then observe several black panthers. Our observations are clearly compatible with the statement, but do they make the hypothesis "all cows are white" more likely?

To analyze such a situation, we consider a probabilistic model. Let us assume that there are two possible states of the world, which we model as complementary events:

<center>$A$: all cows are white,
    
<center>$A^c$: 50% of all cows are white.

Let $p$ be the prior probability $P(A)$ that all cows are white. We make an observation of a cow or a panther, with probability $q$ and $1-q$, respectively, independent of whether event $A$ occurs or not. Assume that $0<p<1, 0<q<1$, and that all panthers are black.




### (a) Given the event $B=$\{a black panther was observed\}, what is $P(A|B)$? Show your work (5pts)

Using the conditional rule, we get:

$P(A|B)$ = $\frac{P(A\cap B)}{P(B)}$

The probability of observing a black panther is independent of whether event A occurs or not, therefore:

$\frac{P(A\cap B)}{P(B)}$ = $\frac{P(A)\times P(B)}{P(B)}$ = $P(A)$ = $p$

### (b) Given the event $C=$\{a white cow was observed\}, what was $P(A|C)$? Show your work (5pts)

Applying Baye's theorem we get:

$P(A|C)$ = $\frac{P(C | A) P(A)}{P(C)}$

$P(C | A)$ describes the probability of observing a white cow given that all cows are white. Since all cows are white, then this should be equivalent to the general probability of observing a cow, therefore:

- $P(C | A)$ = $q$

$P(A)$ is known from the problem statement:

- $P(A)$ = $p$

Finally since $A$ and $A^c$ are mutually exclusive, $P(C)$ can be computed as $P(C \cap A)$ + $P(C \cap A^c)$

Using the multiplication rule we get:

- $P(C \cap A)$ = $P(A) P(C | A)$ = $p q$

To find $P(C \cap A^c)$ we first need to find $P(C | A^c)$. In a world where 50% of cows are white, the probability of observing a white cow $P(C | A^c)$ is just $\frac{1}{2}$ that of observing a cow in general. Therefore:

- $P(C | A^c)$ = $\frac{1}{2}q$

Using the multiplication rule again, we get:

- $P(C \cap A^c)$ = $P(A^c) P(C | A^c)$ = $(1 - p) \frac{1}{2}q$

Putting all together:

- $P(A|C)$ = $\frac{P(C | A) P(A)}{P(C)}$ = $\frac{P(C | A) P(A)}{P(C \cap A) + P(C \cap A^c)}$ = $\frac{q p}{(p q) + (1 - p)(\frac{1}{2}q)}$

Using walfram alpha to simplify the fraction we find:

$P(A|C)$ = $2 - \frac{2}{p + 1}$

### (c) Which is larger? Explain the implication. (5pts)

For the interval $0<p<1$, we get that $2 - \frac{2}{p + 1} \gt p$ (this can be seen graphically by plotting $f(x) = x$ and $g(x) = 2 - \frac{2}{x + 1}$ and observing that $g(x)$ has larger values of $y$ for $0<x<1$).

- $P(A|C) \gt P(A|B)$

The implication is that observing a black panther does not make the hypothesis "all cows are white" more likely and is not equivalent to observing a white cow. Observing a white cow however does make the hypothesis "all cows are white" slightly more likely.

---
## Problem 2:  log odds ratios (35 points)
This exercise is an exploratory analysis of the Sentiment140 dataset. Sentiment140 combines 160K tweets collected via the Twitter API with most of the emoticons removed. Each tweet is annotated with polarity: positive (4), negative (0) and neutral (2).  _We will  **not** consider neutral tweets in this problem_. You do not have to check the original paper that proposed this dataset, but if you are curious, here is the link: [https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf](https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf).

In this problem, we will analyze how often a word tend to appear with a positive sentiment vs. a negative one. The metric we are going to use is  **log odds ratio**, that compares the conditional probability of a word occurring in one type of sentences, say, positive ($P(word|pos)$), and the word occurring in another type of sentences, say, negative ($P(word|neg)$):
$$log\_odds\_ratio(word, pos) = \log\frac{P(word|pos)}{P(word|neg)}$$
The higher the $log\_odds\_ratio$, the more likely the word is associated with positive sentences.


Download from Canvas the file ``sentiment140_sample1.csv`` ---a 10K sample from the training set of Sentiment140---and put it under the  **same directory** (folder) as your python script or notebook file. As a reminder, the file is formatted under six fields, including polarity, tweet ID, date, query username and the text of the tweet. We will only use polarity and tweet text in this assignment.

In the following exercises, we have provided several expected inputs and outputs of the functions that you will implement. Treat these as test cases for your code; if you get numbers very far off from what is listed here with the same input, you have bugs to crush.

In [1]:
# Read in the data (DO NOT CHANGE)
import pandas
sentiment_data = pandas.read_csv("../datasets/sentiment140_sample1.csv", header = None, encoding = "ISO-8859-1")
sentiment_data.head()

Unnamed: 0,0,1,2,3,4,5
0,0,1467817374,Mon Apr 06 22:21:30 PDT 2009,NO_QUERY,ajaxpro,@MissXu sorry! bed time came here (GMT+1) ht...
1,0,1467863716,Mon Apr 06 22:33:35 PDT 2009,NO_QUERY,stacyc37,sad that the 'feet' of my macbook just fell off
2,0,1467878057,Mon Apr 06 22:37:26 PDT 2009,NO_QUERY,debbieseraphina,help me forget 8th april &amp; 13th july!
3,0,1467909292,Mon Apr 06 22:45:54 PDT 2009,NO_QUERY,satori,"@soillodge yes, it will be. it's only Monday"
4,0,1468045043,Mon Apr 06 23:25:27 PDT 2009,NO_QUERY,TigerHasse,"Debbugging old VB6 code, the day could have st..."


In [2]:
sentiment_data.describe()

Unnamed: 0,0,1
count,10000.0,10000.0
mean,2.0092,1997586000.0
std,2.000079,195183300.0
min,0.0,1467817000.0
25%,0.0,1956556000.0
50%,4.0,2002018000.0
75%,4.0,2176924000.0
max,4.0,2329177000.0


### (a) Frequency counts  (5 points)
First, let's create dictionaries that record the count of each word in positive tweets, as well as the count of each word in negative tweets. Here, here, ``counts["pos"]`` will contain key-value pairs of a word and its number of appearance in positive tweets, ``counts["neg"]`` will contain key-value pairs of a word and its number of appearance in negative tweets

To parse the tweets, we will use NLTK's ``word_tokenize()`` function. As an example, the following tokenizes a sentence into a list of words:

In [3]:
import nltk
nltk.download('punkt') #you only have to do this once per environment

from nltk import word_tokenize
word_tokenize("This is a sentence.")

[nltk_data] Downloading package punkt to /home/jboy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['This', 'is', 'a', 'sentence', '.']

Lower-casing all words gives cleaner counts. For example, consider the two sentences: "Apples are delicious. John loves apples." If we do not lower-case each word, ''Apples'' and ''apples'' will be counted as two different words. In Python, you can lower-case a word by calling ``lower()``:

In [4]:
print("Apples" == "apples")
print("Apples becomes", "Apples".lower())
print("Apples".lower() == "apples")

False
Apples becomes apples
True


We will only consider words and not symbols or numbers. To test whether a word is a word, that is, consisting of only English characters, we can use ``isalpha()``:

In [5]:
print("Apples".isalpha())
print("Apples123".isalpha())

True
False


**Complete the code below**

In [6]:
from collections import defaultdict

def get_counts(data):
    """ 
    counts the number of times a word appears in negative or positive tweets
    
    Parameters:
    data: Pandas dataframe of tweets
    
    Returns:
    counts: Dictionary of counts, which includes the dictionaries 'pos' and 'neg'
    
    """
    
    counts = defaultdict(lambda: defaultdict(int))
    
    for _, row in data.iterrows():
        sentiment = row[0]
        tweet = row[5]
        
        for word in word_tokenize(tweet):
            counts[sentiment][word.lower()] += word.isalpha()
            
        
    
    return {"pos": dict(counts[4]), "neg": dict(counts[0])} # positive (4), negative (0)


In [7]:
# Do not change
counts = get_counts(sentiment_data)
print(counts["pos"]["happy"]) # should print 124 or 127
print(counts["neg"]["happy"]) # should print 38
print(counts["pos"]["hate"]) # shuld print 15
print(counts["neg"]["hate"]) # should print 99

127
38
15
99


### (b) Calculating $P(\text{word}|\text{polarity})$ (10 points)

Create a function ``get_word_prob(counts, word, polarity)``, where ``counts`` is a dictionary like in the previous task, ``word`` is the word for which $P(word|polarity)$ will be calculated, and ``polarity`` is either ``pos`` or ``neg``. The function should return $P(word|polarity)$. If ``counts[polarity]`` does not contain ``word``, then return 0.

Note that you should NOT need to use the variable ``data`` here, and only rely on the three arguments of the function: ``counts, word, polarity``.


In [8]:
def get_word_prob(counts, word, polarity):
    """ 
    calculates the probability of a word given a polarity 
    
    Parameters:
    counts (dict): the dictionaries 'pos' and 'neg' which count word occurances
    word (str): the word you want to get the probability for
    polarity (str): wither 'pos' or 'neg'
    
    Returns:
    probability (float):  the probability of a word given a polarity 
    
    """
    # Your code goes here
    
    universe_size = sum(count for _, count in counts[polarity].items()) # Number of total pos/neg words
    try:
        event_size = counts[polarity][word]
    except KeyError:
        event_size = 0
    
    probability = event_size / universe_size
    
    return probability # P(word|polarity)




In [9]:
#Do not change
print(get_word_prob(counts, "great", "pos")) # should be ~0.00254
print(get_word_prob(counts, "glad", "neg")) # should be ~0.000121
print(get_word_prob(counts, "wugs", "neg")) # should be 0


0.002541681986028742
0.00012071280913795966
0.0


### (c) Calculate the log odds ratio of a word  (10 points)


Using the above function, we can calculate $P(word|pos)$ and $P(word|neg)$ given a word, so we are ready to calculate the log odds ratio of that word as well. Create a function ``log_odds_ratio(count_dict, word, polarity)``, where the arguments are of the same type/format as in the previous problem. The function should return $log\_odds\_ratio(word)$:

$$ log\_odds\_ratio(word, polarity) = \log\frac{P(word|polarity)}{P(word|opposite\_polarity)} $$

If the denominator is zero, return a very large number (eg 10000). Again you should NOT need to use the variable ``data`` here, and only rely on the three arguments of the function: ``counts``, ``word``, and ``polarity``.

In [10]:
from math import log
import sys

def log_odds_ratio(counts, word, polarity):
    """ 
    This function returns the log odds ratio of a term (see previous cell)
    
    Parameters:
    counts (dict): the dictionaries 'pos' and 'neg' which count word occurances
    word (str): the word you want to get the probability for
    polarity (str): wither 'pos' or 'neg'
    
    Returns:
    log_odds_ratio (float): log( prob(word|plarity) / P(word|opposity_polarity) )
    
    """
    # Your code goes here
    opposite_polarity = {'pos': 'neg', 'neg': 'pos'}
    
    p_wp = get_word_prob(counts, word, polarity)
    p_wop = get_word_prob(counts, word, opposite_polarity[polarity])
    
    try:
        log_odds_ratio = log(p_wp / p_wop)
    except ZeroDivisionError:
        # division by 0
        log_odds_ratio = 10000
    except ValueError:
        # log of 0
        log_odds_ratio = 0
        
    return log_odds_ratio



In [11]:
# Do not change
print(log_odds_ratio(counts, "great", "pos")) # should be ~1.276
print(log_odds_ratio(counts, "the", "neg")) #  should be ~-0.0906
print(log_odds_ratio(counts, "wug", "neg")) # should be a very large number

1.2764610338152973
-0.0905923001499596
10000


### (d) Sorting log odds ratios (10 points)

After being able to calculate log odds ratios for individual words, we can now sort words according to its association with a polarity class, say, positive. Create a function ``sort_pos_words(data)``, that takes in the entire dataframe as an argument, and return a sorted list of ``(word, log odds ratio)`` tuples for the positive sentiment class.

If you implement this without filtering out any words, you will notice that there are many cases where the conditional probability of the denominator is 0, leading to the very large number you specified in the ``log_odds_ratio()`` function. This is because most words appear only once in the dataset. One way to mitigate this issue is to consider only words that appeared at least $x$ times in the dataset; here, let's only include words that appeared more than 10 times in the dataset, regardless of the polarity of the tweet (positive or negative).

Use your function to print out the top 10 most positive:

`` [('proud', 10000), ('congratulations', 10000), ('vip', 2.7657670338765064), ('yum', 2.696774162389555), ('worry', 2.455612105572667), ('mothers', 2.455612105572667), ('thank', 2.393091748591333), ('jonasbrothers', 2.360301925768342), ('sir', 2.360301925768342), ('fabulous', 2.360301925768342)]``
       
 and the top 10  most negative:
 
`` [('expensive', -2.508004655425329), ('bus', -2.5821126275790505), ('throat', -2.651105499066002), ('hates', -2.651105499066002), ('tummy', -2.715644020203573), ('sad', -2.8052561788932606), ('missing', -2.987577735687215), ('died', -3.121109128311738), ('headache', -3.4694158225799536), ('hurts', -3.8348755960744185)]``

In [12]:
def sort_pos_words(data, word_appearance_min=10):
    """
    takes in a pandas dataframe and outputs the top 10 most positive and negative words in the dataset
    
    Parameters:
    data (pandas.DataFrame): the tweets in a dataframe
    
    Return:
    sorted_list (list): a sorted list of (word, log odds ratio) tuples for the positive sentiment class
    
    """
    # Your code goes here
    counts = get_counts(data)
    
    # Find a set of all the popular words (those that appear at least word_appearance_min times)
    words = defaultdict(int)
    for polarity in ('pos', 'neg'):
        for word, count in counts[polarity].items():
            words[word] += count

    popular_words = {word for word, count in words.items() if count >= word_appearance_min}
    
    # Lambda function to compute the positive log odds ratio of a word
    pos_log_odds_ratio = lambda word: log_odds_ratio(counts, word, 'pos')
    
    # Compute the positive log odds ratios of the popular words
    pos_log_odds_ratios = [(word, pos_log_odds_ratio(word)) for word in popular_words]
    
    # Sort the list by decreasing log odd ratio
    pos_log_odds_ratios.sort(reverse=True, key=lambda x: x[1])
    
    return pos_log_odds_ratios

In [13]:
# Do not change
lst = sort_pos_words(sentiment_data)
print("Top 10 most positive \n", lst[:10]) # see previous cell for what this should print
print("Top 10 most negative \n", lst[-10:])    

Top 10 most positive 
 [('congratulations', 10000), ('proud', 10000), ('vip', 2.7657546344073345), ('yum', 2.696761762920383), ('mothers', 2.455599706103495), ('worry', 2.455599706103495), ('thank', 2.393079349122161), ('sir', 2.36028952629917), ('jonasbrothers', 2.36028952629917), ('fabulous', 2.36028952629917)]
Top 10 most negative 
 [('expensive', -2.5072449241564128), ('bus', -2.5813528963101344), ('hates', -2.6503457677970856), ('throat', -2.714884288934657), ('tummy', -2.714884288934657), ('sad', -2.7373571447867153), ('missing', -2.9868180044182986), ('died', -3.1203493970428213), ('headache', -3.4976436281842895), ('hurts', -3.8741211994192013)]
