# Assignment 06: Text Analysis

### 1. Shakespeare word frequency

#### Make a Python string that contains the text of the Shakespeare play, Macbeth

In [1]:
import requests

In [2]:
import nltk

In [3]:
# obtain Macbeth text from Project Gutenberg
target_url = "https://www.gutenberg.org/files/1533/1533-0.txt"
response = requests.get(target_url)
with open('macbeth.txt','w',encoding='utf-8') as f:
    f.write(response.text)

In [4]:
document = open('macbeth.txt', 'r')
text = document.read().lower()

#### Tokenization

In [5]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [6]:
# tokenize text to sentences
sentences = sent_tokenize(text)

In [7]:
# tokenize sentences to words
words = []
for s in sentences:
    for w in word_tokenize(s):
        words.append(w)

#### Stopwords removal

In [8]:
from nltk.corpus import stopwords
from string import punctuation
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
extra_sw = ["project", "gutenberg", "gutenberg-tm", "--", "...", "\\", "ï", "»", "¿"]

In [10]:
sw = list(punctuation) + stopwords.words('english') + extra_sw

In [11]:
words_no_sw = [w for w in words if w not in sw]

#### Find the top 20 most frequent words in the play

In [12]:
import collections

In [13]:
word_count = collections.Counter(words_no_sw)
word_top_20 = word_count.most_common(20)

In [14]:
print(word_top_20)

[('macbeth', 286), ('macduff', 109), ('lady', 97), ('thou', 88), ('enter', 72), ('shall', 69), ('banquo', 67), ('upon', 61), ('thee', 60), ('malcolm', 58), ('scene', 57), ('yet', 56), ('us', 54), ('ross', 53), ('come', 53), ('witch', 52), ('good', 52), ('thy', 52), ('hath', 51), ('first', 50)]


These words offer some sense of the play, Macbeth. Specifically, they include many character names in Macbeth. Additionally, words like 'banquo' give a good sense of elements and contents of the play. However, many high frequency words are rather general and do not necessarily point to this particular play. It is rather difficult to tell what the play is about from the words. But words like 'thou,' 'thy,' 'hath,' etc. point to the fact that this is an old English piece.

### 2. Yelp sentiments

#### Make a Python string that contains 15 [Yelp](https://www.yelp.com/biz/in-n-out-burger-los-angeles-5) reviews of the restaurant, In-N-Out

In [15]:
from nltk.sentiment import vader
nltk.download("vader_lexicon")

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [16]:
# import reviews
# 5-star
r1 = "Dear Westwood In-N-Out, Thank you for being there for me for post-frat parties and late night paper writing breaks during my UCLA days. Thank you for being there for me post-movie premieres during my entertainment days. And thank you for being there for me in my more recent \"calmer and more responsible\" days. I love this In-N-Out, and to pay tribute, I am choosing it to announce to my Yelpiverse that I have permanently removed beef from my diet (I wanted to get a year under my belt before my true confessions). Please don't strip me of my foodie card. ;-) Cows. Can't do it. They're sentient. I will miss you, In-N-Out. Maybe you will consider a plant-based patty in the future. Because, you know, 2022. That would rock my world. Maybe I'll stop by one day for some animal fries, eh? Thanks for the memories. Adieu."
# 5-star
r2 = "Super convenient location located right near UCLA's campus. We came here during lunch time and the line was already long stretching to the bathroom. Make sure you know your order before you get to the cashier. We ordered 2 burgers animal style as well as 2 fries animal style. Do not sleep on animal style! That is the way to go. We also got 2 t-shirts because we wanted to capture our first time at the iconic In-N-Out. I definitely prefer In-N-Out over Shake Shack because of the animal style!"
# 5-star
r3 = "I can safely say that there isn't a single fast food restaurant I have been to more than In-N-Out. Ok maybe McDonalds (I went there a LOT in college) but In-N-Out reigns supreme in quantity and quality. My usual order is a double-double with grilled onions and a Neapolitan shake. I only get fries if I'm with someone else, and if I do, they are usually animal style. The Neapolitan is the best shake period (secret menu!) and the double-double is the perfect meat/burger ratio. I used to get the single cheeseburger, but that was a mistake. Oh and make sure you get chili peppers on the side. It elevates the burger tenfold if you like spice. Ok that's all folks. Now put down your phone and head over here or I will. Also, parking sucks here. Pro tip - there's 2 hour free parking at Target if you can't find a spot. You have to buy something at the store for validation, but it's definitely worth it during the busy hours."
# 5-star
r4 = "I don't care what anyone says, but In-N-Out always hits at any hour. My roommate and I decided to come here at night-- just cause. This particular location also gets pretty crowded, but we went on a random weekday night, so it wasn't too bad. The 4x4 is basically 4 patties and 4 cheese slices, and honestly, it was quite a lot and super heavy. I'd only get a 4x4 to just try it out once. The double double is a much regular, better option. I think what did it for me was the amount of cheese--too much for my liking. Aside from how giant this burger was, it absolutely HIT. Definitely get it with grilled onions, it elevates the burger so much. I enjoyed taking a bite every time because it's so damn good. The sauce also really wraps the burger nicely. Overall, just a wonderful place for burgers at a pretty decently cheap price!"
# 5-star
r5 = "My heart, my soul, my love. I have nothing but 5 star reviews for each restaurant I have visited. As always I got my classic double double add grilled onions and ketchup, animal style fries, and a pink lemonade. Just as I expected, perfect as always. They have adapted greatly to the COVID pandemic with contactless pay and a fast drive through experience."
# 4-star
r6 = "3.5 stars for the long wait and line up and room temperature not well done French fries. I rounded it up rather than down because the staff made up for it. They were all so pleasant and sweet. It was a drive through of a notorious very busy location that I know of so expectations were not high."
# 4-star
r7 = "This is a rather large In-n-Out with ample outdoor and indoor seating. The food is delicious-- I've never had an issue with my order. Staff are friendly and always working hard. 4/5 stars instead of 5 because, as per usual at any In-n-Out, you have to wait a while for your food. Parking is also limited. The drive-thru is always around the corner and can cause some traffic problems. I (luckily) live close enough so I can walk here-- I can't imagine the stress of navigating the In-n-Out traffic here. Overall, I highly recommend this In-n-Out if you are in the mood for a tasty burger or milkshake. Just be ready to wait a while for your food."
# 4-star
r8 = "BEST burger in LA. consistent and delicious. 10/10 recommend. The line can be a little long but well worth it. Prices are amazing too."
# 3-star
r9 = "After years of listening to my Cali and non-Cali friends rave about In-N-Out, I finally decided to give it a go. The pros: It's difficult not to enjoy the pleasant vintage diner feel and the affordable pricing is another plus. The staff at this location were very helpful and recommended I try the burger and fries animal style (not on the menu but is a secret menu item all In-N-Out regulars know about). This location was very efficient even though they had a lot of customers and long lines. The cons: I thought that the food tasted very average. Burgers were not very flavorful but animal style definitely made them taste better. Fries on the other hand were not great and as a french fry lover I was disappointed by how dry and stiff they were- animal style made them bearable. Overall rating: average. I'm glad I tried it but it is definitely overhyped."
# 3-star
r10 = "This is a pretty good location if I don't feel like driving further west. Lines for In-n-Out normally move very quickly which is great, but being in Westwood, I feel it takes a little longer than what I am used to. Maybe it's because I've only been here once, I'm not too familiar with the layout but if I'm craving it enough, I can certainly get used to it. As usual, friendly employees, and you know what you're getting (and what you are in line for)."
# 3-star
r11 = "In-N-Out is tasty but the bread is always an issue for him (gluten sensitivity). However the protein style set up is a special treat and will try it again. I like that they have that option as part of the menu. I wish they would deliver. This location is very busy with tons of UCLA students, which is great but the nearby homeless is an unattractive issue."
# 2-star
r12 = "This is nothing against the business--the food is just as good as any other In-n-Out, but it's never, ever been worth the nightmare of trying to get through the drive thru line. Not only do you have to wait in a line of cars that rounds the block (not an exaggeration), but also you have to deal with the absolute bozos who are in it. People swerving in front of you to cut the line, yelling out their windows at you because they're scared you might jump in front of them--this all happened to me on my last visit. I'm sorry, no 4 dollar burger is worth me risking damage to my car! I feel a little bad leaving a negative review because of the line, but this location is absolutely inhospitable because of it! You'd probably save time driving to the Venice location."
# 2-star
r13 = "I grew up in So Cal and In N Out was always my go to for burgers. We just got burgers here in Keizer and the food was not what I would have  expected of In N Out. The fries were warm at best and hard. I threw 80% of them in the trash. The Double Double had almost no dressing on it and the buns were hard/over toasted? The shake was good and the Coke was good. I feel like I just wasted $18.00. I could have spent less at Five Guys and maybe got better."
# 1-star
r14 = "Let me start out by saying this is a very nice In-N-Out. There are a lot of tables and the big 3-D 'In-N-Out' in the middle of the restaurant is pretty cool. And also I love In-N-Out food. My go-to is a mustard-friend double double with grilled onions and extra lettuce and extra tomato, dressing and peppers on the side. (Skip the animal fries though - the cheese is never melted and it's gross). However I've had terrible service every time here. The last time I visited traumatized me and I'll never go back to this location. The first time I dined in, my order was taking super long and I saw them continuously push it behind the other orders. If I didn't ask about it, I'm sure it would have taken even longer. My order ended up coming out 10min after my friends' did, and I ordered before they did. The second time I went through the drive-through, they again lost our order. The girl at the second window instructed us to pull over to the side of the drive-through lane and wait for someone to bring our food out. Now, it's very cramped. There's technically enough space for 2 cars side by side, and I pulled over all the way to the side as far as I could go, but some drivers were still nervous about driving through the tight space. Most drivers were able to squeeze by, even a large Range Rover was able to, but some drivers were too scared/bad at driving to do it. So I exited my car to try and help direct the bad ones through the space, but I ended up getting *screamed at* angrily by multiple drivers (they all ended up driving through anyway, with my help and the help of the very friendly homeless man hanging out there) and their passengers. Finally the girl in the window waved for me to come grab our food. I grabbed it and best believe I left as quickly as I could. In conclusion I just really didn't need go to through that traumatizing experience, and it's partially due to the existence of rude people in general but also on In-N-Out for putting everyone in such a bad position. Sorry for being so dramatic. I just can't bring myself to return to this In-N-Out."
# 1-star
r15 = "THIS IN N OUT IS MORE EXPENSIVE THAN ANY OTHER IN N OUT. See the pic for proof. Also whoever took my order in the drive thru on 10/24 at 12:45pm was incompetent. And he gave me my burger plain when I simply asked for no onions (where tf is the tomato and lettuce???). GO TO ANY OTHER IN N OUT WHERE THEY DONT EXPLOIT STUDENTS"

#### Use Vader to find the polarity of each review

In [17]:
sia = vader.SentimentIntensityAnalyzer()

In [18]:
# save reviews into a list
reviews = [r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12, r13, r14, r15]

In [19]:
# save scores into a list
scores = []
for r in reviews:
    scores.append(sia.polarity_scores(r))

In [20]:
for s in scores:
    print(s)

{'neg': 0.031, 'neu': 0.755, 'pos': 0.214, 'compound': 0.9822}
{'neg': 0.018, 'neu': 0.859, 'pos': 0.123, 'compound': 0.8715}
{'neg': 0.054, 'neu': 0.741, 'pos': 0.205, 'compound': 0.9851}
{'neg': 0.035, 'neu': 0.688, 'pos': 0.277, 'compound': 0.994}
{'neg': 0.026, 'neu': 0.851, 'pos': 0.123, 'compound': 0.7935}
{'neg': 0.076, 'neu': 0.804, 'pos': 0.121, 'compound': 0.5846}
{'neg': 0.05, 'neu': 0.858, 'pos': 0.091, 'compound': 0.7027}
{'neg': 0.0, 'neu': 0.502, 'pos': 0.498, 'compound': 0.942}
{'neg': 0.068, 'neu': 0.728, 'pos': 0.204, 'compound': 0.9756}
{'neg': 0.017, 'neu': 0.826, 'pos': 0.156, 'compound': 0.909}
{'neg': 0.047, 'neu': 0.713, 'pos': 0.24, 'compound': 0.9493}
{'neg': 0.211, 'neu': 0.752, 'pos': 0.037, 'compound': -0.982}
{'neg': 0.079, 'neu': 0.756, 'pos': 0.166, 'compound': 0.8777}
{'neg': 0.076, 'neu': 0.837, 'pos': 0.087, 'compound': 0.5324}
{'neg': 0.088, 'neu': 0.885, 'pos': 0.028, 'compound': -0.6126}


In some cases, Vader's scores are a pretty accurate representation of user-specified numbers of stars; however, sometimes they are rather distinct from each other. For the first review, for example, the compound score is about 0.98, which corresponds greatly to the 5-star the user gave. However, for the 13th review, for example, the user gave a 2-star, but Vader's score decides that the review is more positive than negative, with a compound score of 0.88 – this is pretty far away from the reality. Also, it is pretty difficult for Vader's scores to accurately represent a neutral (3-star) review; the results of Vader's score often turn out to be overly positive compared to what the users gave. Overall, Vader's scores can be quite spot on, but they are not always accurate.

### 3. Movie reviews

#### Make 5 strings that contain reviews (3 sentences each) of the movie, The Grand Budapest Hotel

In [21]:
c1 = "Anderson has created an amazing universe that is like watching an MGM film with a foreign story back in the '30s, such as Idiot's Delight. It goes back to the days of exquisite service, enormous wealth and privilege, and idyllic beauty. It's also hilariously funny, with Fiennes popping off the one liners and the young Zero (Tony Revolon) keeping a serious face throughout."
c2 = "Wes Anderson has directed a whimsical and absurd black comedy with a camp performance from Fiennes that has a startling odour. It was a shame that he was not Oscar nominated. The story can get dark and murderous at times, it has an underlying sadness."
c3 = "It's so nice to have a movie that is knock-down drag-out hilarious. From the beginning foray, telling us about how the hotel came to be, all of its history, and the introduction of the cast of characters, the delightful episodic delivery, it reminded me to some extent of those spectacular comedies of the sixties and seventies: \"The Great Race\" and \"It's a Mad, Mad, Mad, Mad World.\" What transpires is Ralph Fiennes' feast for the camera and a story of great love, pain, and intensity."
c4 = "This is a new high for Wes Anderson. He's filled this with his usual unique visual style and his quirky characters. In addition, he has used it in an exciting thriller with a bit of mystery."
c5 = "THE GRAND BUDAPEST HOTEL is another quirky comedy from director Wes Anderson, a man who can seemingly make no other type of movie. I watched and enjoyed the first half an hour of this film, finding it fresh and inventive; however, the magic began to wear off after this point, and by the end I found it more than a little tiresome. I have a feeling that the director's style would best be suited to short films, not overlong efforts like this."

#### Make 5 strings that contain reviews (3 sentences each) of the movie, Birdman

In [22]:
d1 = "'Birdman' is an exceptionally well made film, with some of the best and cleverest cinematography of the year, some of the cinematography and editing is so dazzling it's enough to take the breath away. The special effects are also tremendous. Directed by Alejandro González Iñárritu (in the first of his deserved director wins, the best being 2015's 'The Revenant), 'Birdman' is one of the best directed films of 2014 too and shows Iñárritu's immense talent as a director, with breath-taking vision, sense of mood and the ability to make the story as gripping as possible."
d2 = "Director Alejandro Gonzales Inarritu has created one of the most original and brilliantly witty satires of actors, Hollywood and Broadway. There's no doubt about it but the director has really created a behind-the-scenes look at a struggling actor but not only given us a glimpse into his career but also all the added drama that comes with trying to do something that you should probably fail at. BIRDMAN is a film unlike anything else I've ever seen and that's something rather hard to do in today's day and age."
d3 = "The concept of long continuous scenes is interesting. It adds to the level of difficulty. It is audacious and makes the audience sit up to pay attention."
d4 = "Norton like Sean Penn shows how effortless he can do comedy even while being intense. The real plus about the film is how you can interpret the film. It is a satire on Hollywood fame (with all its cultural references it will also age quickly) but also a reflective film about life and death."
d5 = "While watching \"Birdman,\" the latest from Alejandro Gonzalez Inarritu (\"Babel,\" \"Amores Perros,\" \"21 Grams\") I felt an admiration and respect for it that never bubbled over into emotional involvement. The acting is superb, and Inarritu chooses to film his story in what appears to be one fluid shot; there are no cuts in the film, and instead master cinematographer Emanuel Lubezki (currently the best one in the business as far as I'm concerned) sends his camera swooping around the rooms and hallways of New York's St. James Theatre, where most of the action of the film is set. So I noticed and admired the formal aspects of the film, but didn't connect with it on any meaningful level."

In [23]:
# make a Python list that contains these 10 strings
movie_reviews = [c1, c2, c3, c4, c5, d1, d2, d3, d4, d5]

#### Analysis

In [24]:
extrastop = ['``',"''","'re","'s","'re",'``',"''","'ll","--","\'\'","...",
             "n\'t",'one','would','use','from',"\'m","\'ve"]

In [25]:
myStopWords = list(punctuation) + stopwords.words('english') + extrastop

In [26]:
movie_reviews_no_sw = []
for i in movie_reviews:
    movie_reviews_no_sw.append([w for w in word_tokenize(i.lower()) if w not in myStopWords])

In [27]:
from nltk.stem.porter import PorterStemmer

In [28]:
# create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

In [29]:
listOfStemmedWords = []
for i in movie_reviews_no_sw:
    listOfStemmedWords.append([p_stemmer.stem(w) for w in i])

In [30]:
!pip install gensim



In [31]:
from gensim import corpora, models
import gensim

In [32]:
dictionary = corpora.Dictionary(listOfStemmedWords)

In [33]:
corpus = [dictionary.doc2bow(text) for text in listOfStemmedWords]

In [34]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, 
                                           num_topics=2, 
                                           id2word=dictionary, 
                                           passes=20)

In [35]:
for i in ldamodel.print_topics(num_topics=2, num_words=20):
    print(i)

(0, '0.022*"film" + 0.013*"director" + 0.010*"anderson" + 0.010*"inarritu" + 0.010*"we" + 0.007*"watch" + 0.007*"creat" + 0.007*"alejandro" + 0.007*"birdman" + 0.007*"someth" + 0.007*"actor" + 0.007*"admir" + 0.007*"quirki" + 0.007*"new" + 0.007*"level" + 0.007*"style" + 0.007*"make" + 0.007*"best" + 0.007*"comedi" + 0.007*"stori"')
(1, '0.022*"film" + 0.015*"also" + 0.015*"mad" + 0.012*"stori" + 0.012*"best" + 0.009*"direct" + 0.009*"fienn" + 0.009*"comedi" + 0.009*"like" + 0.008*"\'birdman" + 0.008*"iñárritu" + 0.008*"cinematographi" + 0.008*"show" + 0.008*"back" + 0.008*"delight" + 0.008*"hilari" + 0.008*"director" + 0.008*"great" + 0.008*"intens" + 0.005*"charact"')


Comment on the words that the model chooses to represent the 2 topics, and whether they match with your split between comedies and dramas

The two topics that the model chooses to recommend do not necessarily correspond to the split between comedy and drama. They do not have a distinct divide of the two films. For example, the second topic includes both "birdman" and "anderson," the name of the director of The Grand Budapest Hotel. Many of the words are neutral in terms of genre, such as "film," "best," "director," "actor," etc. For words that may exhibit traits of a specific genre like comedy, they are also not in one topic. For example, "comedi" is in topic 1 while "hilari" is in topic 2.