# CS-5313/7313 Project 4

## IMPORTANT - Make a copy of this colab notebook before working on it.

In this project, you will be exploring various pre-built language models based on the transformer architecture. Transformer networks are a state of the art approach to langauge and time series modeling that makes use a concept called "attention". The first time this design was proposed was in the paper "Attention is All You Need" by Vaswani et al. The paper can be read here: https://arxiv.org/pdf/1706.03762.pdf 

Here you will be making use of the pre-built transformer pipelines provided by Hugging Face Co. You can reference this link on how to use the package for the given task you are trying to complete: https://huggingface.co/transformers/task_summary.html

In [11]:
# Run me to install the package we will be using

! pip install transformers datasets
from transformers import pipeline



## Task 1 - Text generation

In this task you will be using the "text-generation" pipeline to generate text. 

Use three different sized prompts, 1 word, ~5 words, and ~10 words, to generate sequences of length n+5 words, n+20 words, and 100 words, where n is the number of words in the prompt phrase you provided. Generate 3 sequences for each prompt and output length pair. Since this is qualitative, comment on the relative quality of the text that is generated in your report and include examples. How do the parameters affect the quality of the output. In addition to the report, submit another document containing each of these generated sequences, including what the prompt was.

In [27]:
# Text generation

text_generator = pipeline("text-generation") 
prompts = ['I', 'Why are you such a', 'When I grew up, we did not have all these']
for p in prompts: 

  print(text_generator(p, max_length=p.count(' ') + 5, min_length = p.count(' ') + 5, do_sample=True))
  print(text_generator(p, min_length=p.count(' ')+ 20, max_length = p.count(' ')+ 20, do_sample=True))  
  print(text_generator(p, max_length=100, min_length=100, do_sample=True))

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Penguins' season"}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Penguins vs. Ducks 3pt Shot Rebound\n\n2:04 1st: L'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Penguins 2nd - 2-0 (OT) Pinstripe\n\n6:35 PM EST @ NGCN_FRA 0 9 - 2 2 NGCN_FRA 0\n\n6:05 PM EST @ NGCN_FSF 0 2 - 0 0 NGCN_FSF 1\n\n5:49 PM EST @ NGCN_FSG 0 3 - 0 0 NGCN_FSG 1\n\n5:17 PM'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Why are you such a jerk?\' " her'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Why are you such a big fan of the show that you feel you can talk like you're on a quest for real"}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Why are you such a good reader?\n\nTrying to write a book without anyone giving you an opportunity to review it? It doesn't work with me and I don't try to be objective! You cannot give me $100 for writing it or $70 for your review. No one can read a book like me.\n\nFor writing it, you have to be very good if you are thinking about writing it. I do not want to make you a self-aggrandizing piece"}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'When I grew up, we did not have all these wonderful things,"'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'When I grew up, we did not have all these different rules about how to behave. We wanted to behave responsibly and, as a kid,'}]
[{'generated_text': 'When I grew up, we did not have all these kids together."\n\nThe last time I mentioned these families as the "parents" was in "The Day We Were Giants," "The Day We Were Giants," and "The Final One."\n\n"Now, how do you do it?" you asked.\n\n"If I had to choose one," I answered.\n\nWhen I was young, I knew no one ever told me that. But when I went to see this'}]


## Task 2 - Sentiment Analysis 

Here you will be using the "sentiment-analysis" pipeline to look at the sentiment of Amazon reviews. The database is provided at this link: https://jmcauley.ucsd.edu/data/amazon/ . It is recommended to use one of the smaller databases, such as the Musical Instruments database with 10,261 reviews. 

Each review has both the text of the review, as well as the reviewer's rating. 

### Subtask 2.1
Perform sentiment analysis on each review, and compare the model's output to the users review to get a sense of the accuracy of the model. The user review score, which is out of a maximum 5 stars, is found in the "overall" datafield. For this, assume that 3+ in the "overall" datafield is a positive review. 

### Subtast 2.2
In addition to looking at the accuracy of the model for each review, also compare the percentage of products with more positive reviews than negative reviews to the true percentage. 

Reminder: Refer to the hugging face link on how to perform sentiment analysis task.


In [14]:
# Subtask 2.1
import json

classifier = pipeline("sentiment-analysis")

f = open('/home/new_gift_cards.json')
data = json.load(f)
f.close()
 
each_star = {
    1.0: {
        'num_pos': 0,
        'num_neg': 0,
        'accuracy': 0.0,
    },
    2.0: {
        'num_pos': 0,
        'num_neg': 0,
        'accuracy': 0.0,
    },
    3.0: {
        'num_pos': 0,
        'num_neg': 0,
        'accuracy': 0.0,
    },
    4.0: {
        'num_pos': 0,
        'num_neg': 0,
        'accuracy': 0.0,
    },
    5.0: {
        'num_pos': 0,
        'num_neg': 0,
        'accuracy': 0.0,
    },
}

all_reviews = []
max_each_n_star = 300

for dataItem in data:
    review = dataItem.get("reviewText")
    rating = dataItem.get("overall")

    #vvvvvvvvv break and iteration vvvvvvv
    num_analyzed = each_star.get(rating).get('num_pos') + each_star.get(rating).get('num_neg')
    if(num_analyzed > max_each_n_star):
      continue
    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    #vvvvvvvvv unimportant checks vvvvvvvvvvv
    if(review == None or rating == None):
      print(dataItem)
      print('\n\nerr830')
      break
    if(len(review) > 500):
      #print('review too long\n')
      continue
    if(not type(rating) == float):
      print('\n\nerr902')
      print(type(rating), rating)
      break
    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    ai_result = classifier(review)[0]
    label = ai_result.get('label')
    
    if(label == 'POSITIVE'):
      each_star.get(rating)['num_pos'] += 1
    elif(label == 'NEGATIVE'):
      each_star.get(rating)['num_neg'] += 1
    else:
      print('err422')
      exit()

    all_reviews.append((rating, label))

for f in [1.0, 2.0, 3.0, 4.0, 5.0]:
  desired_label = 'num_neg'
  if(f >= 3.0):
    desired_label = 'num_pos'
  n_star = each_star.get(f)
  n_star['accuracy'] = n_star[desired_label] / (n_star['num_neg'] + n_star['num_pos'])

print(each_star)
print(all_reviews)




No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


review too long

review too long

review too long

review too long

review too long

review too long

review too long

review too long

review too long

review too long

review too long

{1.0: {'num_pos': 0, 'num_neg': 18, 'accuracy': 1.0}, 2.0: {'num_pos': 0, 'num_neg': 11, 'accuracy': 1.0}, 3.0: {'num_pos': 15, 'num_neg': 17, 'accuracy': 0.46875}, 4.0: {'num_pos': 131, 'num_neg': 20, 'accuracy': 0.8675496688741722}, 5.0: {'num_pos': 267, 'num_neg': 34, 'accuracy': 0.8870431893687708}}
[(5.0, 'POSITIVE'), (4.0, 'POSITIVE'), (5.0, 'POSITIVE'), (5.0, 'POSITIVE'), (5.0, 'POSITIVE'), (5.0, 'NEGATIVE'), (5.0, 'POSITIVE'), (5.0, 'POSITIVE'), (4.0, 'POSITIVE'), (5.0, 'POSITIVE'), (5.0, 'NEGATIVE'), (5.0, 'NEGATIVE'), (5.0, 'POSITIVE'), (4.0, 'POSITIVE'), (5.0, 'POSITIVE'), (5.0, 'POSITIVE'), (5.0, 'POSITIVE'), (4.0, 'POSITIVE'), (5.0, 'POSITIVE'), (5.0, 'POSITIVE'), (5.0, 'POSITIVE'), (5.0, 'POSITIVE'), (5.0, 'POSITIVE'), (5.0, 'POSITIVE'), (5.0, 'POSITIVE'), (5.0, 'POSITIVE'), (5.0, 'POSITI

## Task 3 - Masked Language Modeling

In this task you will be completing a sentence or phrase that has a missing word. I will prepare three datasets for you so you can perform this task. The three datasets will contain missing verbs in one, missing nouns in another, and missing adjectives in the last.

Your task will be to look at the success rate of generating the true missing word in the top 1, top 5, and top 10 generated words for a given sentence. You will then compare the success rate between the three datasets.

At the moment, these datasets are not prepared. I will update this project and send out and email later so you can download the datasets. 

In [22]:

from transformers import pipeline
from pprint import pprint
import csv


unmasker = pipeline("fill-mask")

filepaths = ['/home/masked_nouns.csv', '/home/masked_verbs.csv', '/home/masked_adjs.csv']  
percent_correct = []
for filepath_and_name in filepaths:
  all_sentences = []
  all_answers = []

  with open(filepath_and_name, 'r') as csvfile:
      datareader = csv.reader(csvfile, skipinitialspace=True, quotechar='"')
      # i = 20

      for row in datareader:

          sentence = row[0]
          sentence = sentence.replace('__MASKED__', f"{unmasker.tokenizer.mask_token}")
          all_sentences.append(sentence)

          all_answers.append(row[1])
          # i-=1
          # if(i==0):
          #   break


  if(not len(all_sentences) == len(all_answers)):
    print('err13798')
    exit()

  i = 0
  guesses = []
  for sentence in all_sentences:
    i += 1
    #print(i)
    token_str = unmasker(sentence)[0].get('token_str')
    guesses.append([token_str.strip()])

  if(not len(all_answers) == len(guesses)):
    print('err4829')
    exit()

  num_correct = 0
  for i in range(len(all_answers)):
    if(all_answers[i] in guesses[i]):
      num_correct += 1
    

  print(all_answers)
  print(guesses)

  percent_correct.append('% correct 1st guesses in' + filepath_and_name + ':' + str(num_correct/len(all_answers)))

#--------------------------------------------------------------

for i in percent_correct:
  print(i)



No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
