# COMS W1002 Computing in Context: Computing in the humanities
## Project 1: Tweets
## Due November 12th at 11:59PM  

In this project you will develop tools for performing sentiment analysis on a database of tweets from across the country. When the project is complete you should be able to estimate the sentiment of tweets filtered by content.

There are 4 files provided here: http://www.cs.columbia.edu/~cannon/tweet_data/

1. `all_tweets.txt`    is the large collection of tweets 
2. `some_tweets.txt`   is a subset of all_tweets that's more manageable to prototype on
3. `sentiments.csv`    a csv with word sentiment values
4. `zips.csv` (not required, see below)

We will go over the format of each of these files in class. 

**Tweets:**
We will represent a tweet using a Python dictionary with the following entries: 
* text: a string, the text of the tweet all in lowercase
* time: a datetime object, date and time of the tweet
* latitude: a float, the latitude of the tweet's location
* longitude: a float, the longitude of the tweet's location


**Problem 1** Write a function called `make_tweets` that takes as input a file name and returns a list of dictionaries. Each dictionary corresponds to a tweet.

In [11]:
### worked with Antonio Cerros and Celena Dong

def make_tweets(file):
    import datetime
    dics = []
    
    with open(file, 'r', encoding = 'utf8') as f:
        line = f.readline()
        while line != '': 
                
            l = line.split('\t')
            
            # determine if next line should be a new dic
            while (len(l) <4):
                if len(l)<2:
                    dic['text'] = text + ' ' + line.rstrip('\n')
                line = f.readline()
                l = line.split('\t')
            
            # find the latitude and longitude
            sub_list = l[0].rstrip("]")
            sub_list = sub_list.lstrip("[")
            sub_list = sub_list.split(",")

            latitude = float(sub_list[0])
            longitude = float((sub_list[1]).lstrip(" "))
            
            # find the datetime
            sub_list2 = l[2].split(" ")
            dates = sub_list2[0].split('-')
            times = sub_list2[1].split(':')
            date = datetime.datetime(int(dates[0]), int(dates[1]), int(dates[2]), int(times[0]), int(times[1]), int(times[2]))

            # find the text
            text = l[3].rstrip('\n')
            
            # make dic
            dic = {'text': text, 'time': date, 'latitude': latitude, 'longitude': longitude}
            
            # append dic to dics
            dics.append(dic)
            
            # read the next line 
            line = f.readline()
            
    return dics
            

**Problem 2** Write a function `add_sentiment` to determine the sentiment of each tweet by taking the average sentiment over all of the words in the tweet. The function should return a new list of tweets where each tweet has a new key '`sentiment`' with numeric value between -1 and 1, or *None* representing the sentiment of the tweet. Note: words without a sentiment do not have sentiment 0. Your function should take as input a list of tweets (dictionaries) together with the name of the sentiment file. Be careful that your function does not alter the original list (no side effects!)

In [13]:
### worked with Antonio Cerros
def add_sentiment(tweets,filename):
    import csv
    
    # new list
    new_tweets = []
    puncs = '.,"!:;/?()&*'
    
    # take each tweet
    for tweet in tweets:
            
        words = tweet['text']
        total_sentiments = 0
        count = 0  
        
        # make sure every word is lowercase and without punctuation
        words = words.lower()
        words = ''.join(ch for ch in words if ch not in puncs)
              
   
        # open sentiment file
        with open(filename, 'r', encoding = 'utf8') as f:
            sent_file = csv.reader(f)
            answer = False
            
            for row in sent_file:
                sent_word = row[0]
                
                # different methods for a frase and a single word
                if len(sent_word.split()) > 1:
                    if sent_word in words:
                        answer = True
                else:
                    word_list = words.split()
                    for word in word_list:
                        if word == sent_word:
                            answer = True
            
                if answer == True:  
                   
                    # if the word is in the sentiment file, value = value in sentiment file
                    total_sentiments += float(row[1])
                    count += 1
                    answer = False
                            
        # create new dictionary with all the info from the previous dictionary, plus the sentiment floating point value
        dic = tweet.copy()
            
        # if none of the words had a sentiment, the sentiment is None
        if count == 0:
            dic['sentiment'] = None
        else:
            
            # take the average sentiment of all the words that have a sentiment
            dic['sentiment'] = total_sentiments / count
            
        # add dictionary to list
        new_tweets.append(dic)
            
    return new_tweets

**Problem 3**  Write a function called `tweet_filter` that will return a new list of tweets filtered by the content of the tweet text. The input for this function should be a list of tweets and a list of words (strings). The function should return a list of tweets that each include *all* of the words in the word list ignoring case and punctuation. Note: Since you are not changing the tweets, as long as the returned list is new, you don't have to worry about side-effects on the tweets here.



In [15]:
def tweet_filter(tweets, words):
    
    # empty list
    filtered = []
    puncs = set('.,"!:;/?()&*')
    
    # get each tweet
    for tweet in tweets:
        answer = True
        
        # remove all punctuation
        text = tweet['text']
        text = text.lower()
        text = ''.join(ch for ch in text if ch not in puncs)
            
        # check to see if the words are in the tweet
        for word in words:
            if len(word.split(' ')) > 1:
                if word not in text:
                    answer = False
            else:
                tweet_text = text.split(' ')
                if word not in tweet_text:
                    answer = False
                
        # if answer = true, add the tweet to the list 
        if answer == True:
            filtered.append(tweet)
                        
    return filtered

**Problem 4** Use your work above and below to answer the following questions:
1. What is the average sentiment of tweets containing the word 'beer'
2. What is the average sentiment of tweets containing the word 'coffee'
3. Consider the average sentiment of the tweets containing the words 'beer', 'movie','coffee', and 'work'. Which word leads to a list of tweets with the lowest average sentiment?

In [28]:
# Include the code you wrote for Problem 4 here

# create a function to find the average sentiment
def avg_sentiment(tweet_file, sent_file, words):
    sentiment = 0
    tweet_list = make_tweets(tweet_file)
    
    # get a list of all tweets with the words in them
    sub_list = tweet_filter(tweet_list, words)
    
    # includng the sentiments
    sub_list = add_sentiment(sub_list, sent_file)
    
    # get the sum
    for tweet in sub_list:
        if tweet['sentiment'] != None:
            sentiment += tweet['sentiment']
    
    if len(sub_list) != 0 and sentiment!= None:
        
        # return the average
        return sentiment / len(sub_list)
        
    else:
        return None
        
    


In [27]:
print(avg_sentiment('all_tweets.txt', 'sentiments.csv', ['beer']))
print(avg_sentiment('all_tweets.txt', 'sentiments.csv', ['coffee']))
print(avg_sentiment('all_tweets.txt', 'sentiments.csv', ['movie']))
print(avg_sentiment('all_tweets.txt', 'sentiments.csv', ['work']))

2220
0.0387132891820392
4571
0.06249799208606364
3084
0.07889053203425771
17480
-0.01940605305465581


Write the answers to problem 4 here:

1. Average sentiment: 0.0387133
2. Average sentiment: 0.0624979
3. Lowest average sentiment: 'work'


### For the aspiring hacker: *(not for credit)*

Notice you have geographical information here. How can you use the longitude and latitude information together with the zips.csv file to allow for queries that are filtered by location as well as message content? Implement such a mechanism.