# Python Functions, Files, and Dictionaries PROJECT

by the University of Michigan via Coursera

## Part 1: Sentiment Classifier

We have provided some synthetic (fake, semi-randomly generated) twitter data in a csv file named `project_twitter_data.csv` which has the text of a tweet, the number of retweets of that tweet, and the number of replies to that tweet. We have also words that express positive sentiment and negative sentiment, in the files `positive_words.txt` and `negative_words.txt`.

Your task is to build a sentiment classifier, which will detect how positive or negative each tweet is.

You will create a csv file, which contains columns for the `Number of Retweets`, `Number of Replies`, `Positive Score` (which is how many happy words are in the tweet), `Negative Score` (which is how many angry words are in the tweet), and the `Net Score` for each tweet.

At the end, you upload the csv file to Excel or Google Sheets, and produce a graph of the Net Score vs Number of Retweets.

### Step 1

To start, define a function called `strip_punctuation` which takes one parameter, a string which represents a word, and removes characters considered punctuation from everywhere in the word. (Hint: remember the **.replace()** method for strings.)

In [17]:
punctuation_chars = ["'", '"', ",", ".", "!", ":", ";", '#', '@']

def strip_punctuation(word):
    # remove any punctuation from the word
    # use the .replace() method
    # NOTE: replace() does not modify the original word
    for punct in punctuation_chars:
        word = word.replace(punct, '')
    return word

print(strip_punctuation('!Amazing#'))

Amazing


### Step 2

Next, copy in your `strip_punctuation` function and define a function called `get_pos` which takes one parameter, a string which represents one or more sentences, and calculates how many words in the string are considered positive words.

Use the list, `positive_words` to determine what words will count as positive. The function should return a positive integer - how many occurrences there are of positive words in the text.

Note that all of the words in `positive_words` are lower cased, so you’ll need to convert all the words in the input string to lower case as well.

In [32]:
# copy in strip_punctuation function...
def strip_punctuation(word):
    for punct in punctuation_chars:
        word = word.replace(punct, '')
    return word

punctuation_chars = ["'", '"', ",", ".", "!", ":", ";", '#', '@']

# create a list of positive words from positive_words.txt
positive_words = []
with open("files/positive_words.txt") as pos_f: ## REMOVE files/ WHEN PASTING ##
    for lin in pos_f:
        if lin[0] != ';' and lin[0] != '\n':
            # print(lin)
            positive_words.append(lin.strip())
print(positive_words[:5])
pos_f.close()

['a+', 'abound', 'abounds', 'abundance', 'abundant']


In [42]:
def get_pos(string):
    count = 0
    #clean string by:
        #1. lowercasing it
        #2. split into words
    #IMPORTANT: must use word list rather than just counting in string
                # because, for example, we don't want to overcount the word 'wonderful'
                # since 'won', 'wonder', 'wonderful' are all positive words
    #for each word:
        #1. strip punctuation
        #2. see if it is in the positive word list
        #2. if so increment the count
    string = string.lower()
    word_list = string.split()
    print(word_list)
    for word in word_list:
        word = strip_punctuation(word)
        if word in positive_words:
            count += 1
    return count

In [43]:
foo1 = "what a truly Wonderful day it is today! #incredible" #2
foo2 = "what a truly wonderful day it is today!" #1
foo3 = "the weather is what it is." #0
foo4 = "The weather truely is abnormal - it's october and already snowing!" #0

In [47]:
get_pos(foo4)

['the', 'weather', 'truely', 'is', 'abnormal', '-', "it's", 'october', 'and', 'already', 'snowing!']


0

### Step 3

Next, copy in your `strip_punctuation` function and define a function called `get_neg` which takes one parameter, a string which represents one or more sentences, and calculates how many words in the string are considered negative words.

Use the list, `negative_words` to determine what words will count as negative. The function should return a positive integer - how many occurrences there are of negative words in the text.

Note that all of the words in `negative_words` are lower cased, so you’ll need to convert all the words in the input string to lower case as well.

In [48]:
# copy in strip_punctuation function...
def strip_punctuation(word):
    for punct in punctuation_chars:
        word = word.replace(punct, '')
    return word

punctuation_chars = ["'", '"', ",", ".", "!", ":", ";", '#', '@']

# create a list of negative words from negative_words.txt
negative_words = []
with open("files/negative_words.txt") as neg_f: ## REMOVE files/ WHEN PASTING ##
    for lin in neg_f:
        if lin[0] != ';' and lin[0] != '\n':
            # print(lin)
            negative_words.append(lin.strip())
print(negative_words[:5])
neg_f.close()            

def get_neg(string):
    count = 0
    string = string.lower()
    word_list = string.split()
    print(word_list)
    for word in word_list:
        word = strip_punctuation(word)
        if word in negative_words:
            count += 1
    return count

['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable']


### Step 4

Finally, copy in your previous functions and write code that opens the file `project_twitter_data.csv` which has the fake generated twitter data (the text of a tweet, the number of retweets of that tweet, and the number of replies to that tweet).

Your task is to build a sentiment classifier, which will detect how positive or negative each tweet is.

Copy the code from the code windows above, and put that in the top of this code window.

Now, you will write code to create a csv file called `resulting_data.csv`, which contains the `Number of Retweets`, `Number of Replies`, `Positive Score` (which is how many happy words are in the tweet), `Negative Score` (which is how many angry words are in the tweet), and the `Net Score` (how positive or negative the text is overall) for each tweet. The file should have those headers in that order.

Remember that there is another component to this project.

You will upload the csv file to Excel or Google Sheets and produce a graph of the Net Score vs Number of Retweets. Check Coursera for that portion of the assignment, if you’re accessing this textbook from Coursera.

In [67]:
# copy in all of my functions...
def strip_punctuation(word):
    for punct in punctuation_chars:
        word = word.replace(punct, '')
    return word

def get_pos(string):
    count = 0
    string = string.lower()
    word_list = string.split()
    for word in word_list:
        word = strip_punctuation(word)
        if word in positive_words:
            count += 1
    return count

def get_neg(string):
    count = 0
    string = string.lower()
    word_list = string.split()
    for word in word_list:
        word = strip_punctuation(word)
        if word in negative_words:
            count += 1
    return count

punctuation_chars = ["'", '"', ",", ".", "!", ":", ";", '#', '@']

# create a list of positive words from positive_words.txt
positive_words = []
with open("files/positive_words.txt") as pos_f: ## REMOVE files/ WHEN PASTING ##
    for lin in pos_f:
        if lin[0] != ';' and lin[0] != '\n':
            positive_words.append(lin.strip())
pos_f.close()

# create a list of negative words from negative_words.txt
negative_words = []
with open("files/negative_words.txt") as neg_f: ## REMOVE files/ WHEN PASTING ##
    for lin in neg_f:
        if lin[0] != ';' and lin[0] != '\n':
            negative_words.append(lin.strip())
neg_f.close()

In [68]:
twitter_file = open('files/project_twitter_data.csv') ## REMOVE files/ WHEN PASTING ##
t_data = []
for lin in twitter_file:
    t_data.append(lin.strip().split(','))
twitter_file.close()

In [69]:
t_txt_id = t_data[0].index('tweet_text') #0
t_rt_id = t_data[0].index('retweet_count') #1
r_rp_id = t_data[0].index('reply_count') #2

In [71]:
my_data = open('files/resulting_data.csv', 'w')
my_data.write('Number of Retweets,Number of Replies,Positive Score,Negative Score,Net Score\n')
for data in t_data[1:]:
    retweets = data[t_rt_id]
    replies = data[r_rp_id]
    positive = get_pos(data[t_txt_id])
    negative = get_neg(data[t_txt_id])
    net = positive - negative
    my_data.write('{0},{1},{2},{3},{4}\n'.format(retweets, replies, positive, negative, net))
my_data.close()