# Laboratory: Tweets Classification - Part II (25%)

**Main Topics:** Text Processing and Text Exploration  

**Deadline:** March 17th (Friday) 11:59 PM (Finnish Time)

**Author:** Andrés Felipe Zapata Palacio  

**Tasks:**  
* #1 (15 Points) Transform the POS Tags
* #2 (15 Points) Lemmatize the tokens
* #3 (15 Points) Define your own Stop Words
* #4 (15 Points) Prepare the training data and the labels
* #5 (40 Points) Train the model, define a good threshold and write tweets to validate the "intelligence" of your model.

🙋 If you have any question related to this assignment, you can write it filling the following form: https://forms.gle/gmQci5fqJSqXrCVk9

## ⚠️ Important Information about the Submission Process ⚠️

You know the current restrictions that I have as Teacher at SAMK. Given that I don't have an institutional account, I cannot access in any way the Moodle platform. For this reason, I will use Dropbox to receive your exams. To do the following procedure you don't have to create a Dropbox account:

1. Enter into the following Link: https://www.dropbox.com/request/T6GDjfkPitHCOdAXj5GK

2. Click on the button "Add Files", and then "Files from your Computer"

3. Upload your .ipynb file

4. If you are not logged in, the platform will ask you your name and your e-mail. Please, enter your Full Name and your institutional e-mail.


If you have any trouble uploading the files, you MUST contact me. Don't wait until the deadline finishes.

**Important Notes** ⚠️

* You are allowed to modify only the parts of the code that are delimited by the commentaries. These sections usually have a commentary that says: "Write your code here".

* For the open questions, you have to explain your opinion or decision in detail. You must demonstrate that you dominate the topics of the course with your answer.

* Upload the Notebook in Moodle using the format ipynb, you must export the notebook without removing the outputs from the cells.

* Verify that your notebook runs without errors before submitting it in Moodle

**Name of the Student (Penalization of 5 points)** ⚠️

One requirement for this assignment is to change the name of the file before uploading it to the system E.j (NLP_04_Lab_Andres_Zapata.ipynb). Additionally, you have to write your name in the space bellow:

```
Dawid Nalepa
```

If you don't do these two steps, your final score will be reduced in 5 points.



In [None]:
import nltk
# Tweet Sample Dataset
nltk.download('twitter_samples')

# POS Tagging
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

# Lemmatizer
nltk.download('wordnet')

# Stop Words
nltk.download('stopwords')

# Numpy
import numpy as np

# Regular Expressions
import re

# DataFrames
import pandas as pd

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
#@title Auxiliar Functions for Evaluation ⚠️
#@markdown ⚡ Run This cell to load the functions that help you to validate if your tasks are correctly done or not


############    Validate equivalence    ############

def listsHaveSameValues(list1, list2):
    if type(list1) != list or type(list2) != list:
        return False
    if len(list1) != len(list2):
        return False
    for item in list1:
        if item not in list2:
            return False
    return True

def dictionariesHaveSameValues(dict1, dict2):
    if type(dict1) != dict or type(dict2) != dict:
        return False
    if len(dict1) != len(dict2):
        return False
    for key in dict1:
        if key not in dict2:
            return False
        if dict1[key] != dict2[key]:
            return False
    return True

def stringsHaveSameValues(str1, str2):
    return str1 == str2

############    Answer is Correct    ############

def answerIsCorrectList(correctAnswer, input, yourFunction):
  import types
  if not isinstance(yourFunction, types.FunctionType):
    return False
  yourAnswer = yourFunction(input)
  return listsHaveSameValues(correctAnswer, yourAnswer)

def answerIsCorrectDict(correctAnswer, input, yourFunction):
  import types
  if not isinstance(yourFunction, types.FunctionType):
    return False
  yourAnswer = yourFunction(input)
  return dictionariesHaveSameValues(correctAnswer, yourAnswer)

def answerIsCorrectString(correctAnswer, input, yourFunction):
  import types
  if not isinstance(yourFunction, types.FunctionType):
    return False
  yourAnswer = yourFunction(input)
  return stringsHaveSameValues(correctAnswer, yourAnswer)

############    Print Diffs    ############

def printDifferences(correctAnswer, yourAnswer, input):
      print(f'Input:\t\t{input}')
      print(f'Correct Answer:\t{correctAnswer}')
      print(f'Your Answer: \t{yourAnswer}')
      print()


def printDifferencesBetweenDicts(correctDict, yourDict, input=None):
    keysOnlyInCorrect = []
    keysOnlyInYours = []
    keysWithDifferentValues = []

    allKeys = []
    allKeys.extend(list(correctDict))
    allKeys.extend(list(yourDict))
    allKeys = set(allKeys)

    for key in allKeys:
      if (key in correctDict) and (key not in yourDict):
        keysOnlyInCorrect.append(key)
      elif (key in yourDict) and (key not in correctDict):
        keysOnlyInYours.append(key)
      elif correctDict[key] != yourDict[key]:
        keysWithDifferentValues.append(key)
    if (input != None):
      print(f'Input:\n{input}\n')
    print(f'Keys that you are missing:\n{keysOnlyInCorrect}\n')
    print(f'Keys that should not be in your answer:\n{keysOnlyInYours}\n')
    print(f'Keys with wrong values:\n{keysWithDifferentValues}')

############    Test Answer    ############

def testAnswers(yourImplementation, answersAndInputs, answerType):
  if answerType == 'list':
    answerIsCorrect = answerIsCorrectList
    printDiffs = printDifferences
  elif answerType == 'dict':
    answerIsCorrect = answerIsCorrectDict
    printDiffs = printDifferencesBetweenDicts
  elif answerType == 'string':
    answerIsCorrect = answerIsCorrectString
    printDiffs = printDifferences
  else:
    raise Exception(f'Answer Type is not recognized: {answerType}')
  import types
  if not isinstance(yourImplementation, types.FunctionType):
    raise Exception('Your implementation is not a function')
  nTests = len(answersAndInputs)
  for i in range(nTests):
    correctAnswer, input = answersAndInputs[i]
    print(f'Test {i+1}/{nTests} ', end='')
    if answerIsCorrect(correctAnswer, input, yourFunction=yourImplementation):
      print('✅')
      print(f'Input: \t{input}')
      print(f'Answer:\t{correctAnswer}')
      print()
    else:
      yourAnswer = yourImplementation(input)
      print('❌')
      printDiffs(correctAnswer, yourAnswer, input)

############    Print    ############

def showError(message, functionName):
    print(f'Error at Function {functionName}: {message}')

print('The auxiliar functions were loaded successfully')

The auxiliar functions were loaded successfully


##☑️ Task #1 Transform the POS Tags (15 Points)

Tag Set used by WordNet Lemmatizer:

* n : Noun
* v : Verb
* a : Adjective
* r : Adverbs

Universal Tag Set used by NLTK pos_tag() function:

* NOUN : Noun
* VERB : Verb
* ADJ  : Adjective
* ADV  : Adverb
* And many other tags for other categories

**🎯 Task:** Implement a function that receives a string with a Universal POS Tag (NOUN,VERB,ADJ,ADV) and returns its respective tag for WordNet Lemmatizer. If the functions receives a tag that is not NOUN, VERB, ADJ or ADV, you must return None.

In [None]:
def transformPosTag(posTag):
  #####   Write your code here   #####
  if posTag.startswith('N'):
    newPosTag = 'n'
  elif posTag.startswith('V'):
    newPosTag = 'v'
  elif posTag.startswith('ADJ'):
    newPosTag = 'a'
  elif posTag.startswith('ADV'):
    newPosTag = 'r'
  else:
    newPosTag = None
  ####################################
  return newPosTag

In [None]:
#@title Test your implementation of transformPosTag() ⚠️

#@markdown ⚡ Run this cell to validate if you implemented the function correctly


def checkTransformPosTag():
  answersAndInputs = [
    ('a','ADJ'),
    ('v','VERB'),
    ('n','NOUN'),
    ('r','ADV'),
    (None,'aTagThatDoesntExist'),
  ]
  yourImplementation = transformPosTag
  answerType = 'string'
  testAnswers(yourImplementation, answersAndInputs, answerType)

checkTransformPosTag()

Test 1/5 ✅
Input: 	ADJ
Answer:	a

Test 2/5 ✅
Input: 	VERB
Answer:	v

Test 3/5 ✅
Input: 	NOUN
Answer:	n

Test 4/5 ✅
Input: 	ADV
Answer:	r

Test 5/5 ✅
Input: 	aTagThatDoesntExist
Answer:	None



##☑️ Task #2 Lemmatize the Tokens (15 Points)

Complete the code in the next cell.

**NLTK POS Tagging:** First Part of the task
[Documentation](https://www.nltk.org/api/nltk.tag.html#module-nltk.tag)

```python
from nltk.tag import pos_tag
```

**NLTK Lemmatizer** Second Part of the task
[Documentation](https://www.nltk.org/api/nltk.stem.wordnet.html)

```python
from nltk.stem.wordnet import WordNetLemmatizer
```


**⚠️ Dependencies:** This task depends on **Task 1**

**🎯 Task 2.1:** Use NLTK POS Tagger passing the entire list of tokens that are received as parameter of the function. You must use the Universal Set of tags.

**🎯 Task 2.2:** Use NLTK Lemmatizer passing each token and the respective POS tag of each token.

In [None]:
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer

In [None]:
def lemmatizeTokens(tokens):
  lemmatizer = WordNetLemmatizer()
  lemmatizedTokens = []
  #############         Write your code here         #############
  nltk.download('omw-1.4')
  tokensWithPosTags = pos_tag(tokens, tagset = 'universal') # (2.1) Use pos_tag() to get the POS tag USING the universal tag set
  for token,posTag in tokensWithPosTags:
    posTag = transformPosTag(posTag)
    if posTag == None:
      lemmatizedTokens.append(token)
    else:
      lemmatizedToken = lemmatizer.lemmatize(token, pos=posTag) # (2.2) Use the lemmatizer to get the lemmatization of the token
      lemmatizedTokens.append(lemmatizedToken)
  ################################################################
  return lemmatizedTokens

In [None]:
tokens = 'two women , two oxen , two oases , two mice'.split()
print('Tokens:',tokens)
print('Lemmatized Tokens:',lemmatizeTokens(tokens))

Tokens: ['two', 'women', ',', 'two', 'oxen', ',', 'two', 'oases', ',', 'two', 'mice']


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Lemmatized Tokens: ['two', 'woman', ',', 'two', 'ox', ',', 'two', 'oasis', ',', 'two', 'mouse']


In [None]:
#@title Test your implementation of lemmatizeTokens() ⚠️

#@markdown ⚡ Run this cell to validate if you implemented the function correctly

def checkLemmatizeTokens():
  answersAndInputs = [
    # Nouns and Verbs
    (['i','be','google','everything'],['i','am','googling','everything']),
    # Verbs
    (['play', 'with', 'that', 'be', 'very', 'dangerous'],['playing','with','that','is','very','dangerous']),
    # Noun
    (['animal', 'and', 'pet', 'be', 'beautiful'],['animals', 'and', 'pets', 'are', 'beautiful']),
    # Noun
    (['thanks', 'for', 'your', 'blessing'],['thanks', 'for', 'your', 'blessings']),
    # Comparatives and Superlatives
    (['you', 'be', 'smart', 'than', 'average,', 'but', 'not', 'the', 'smart'],['you', 'are', 'smarter', 'than', 'average,', 'but', 'not', 'the', 'smartest']),
    # Irregular conjugations
    (['two', 'woman', ',', 'two', 'ox', ',', 'two', 'oasis', ',', 'two', 'mouse'],['two', 'women', ',', 'two', 'oxen', ',', 'two', 'oases', ',', 'two', 'mice']
),
    # Empty Input should work too
    ([],[])
  ]
  yourImplementation = lemmatizeTokens
  answerType = 'list'
  testAnswers(yourImplementation, answersAndInputs, answerType)

checkLemmatizeTokens()

Test 1/7 ✅
Input: 	['i', 'am', 'googling', 'everything']
Answer:	['i', 'be', 'google', 'everything']

Test 2/7 ✅
Input: 	['playing', 'with', 'that', 'is', 'very', 'dangerous']
Answer:	['play', 'with', 'that', 'be', 'very', 'dangerous']

Test 3/7 ✅
Input: 	['animals', 'and', 'pets', 'are', 'beautiful']
Answer:	['animal', 'and', 'pet', 'be', 'beautiful']

Test 4/7 ✅
Input: 	['thanks', 'for', 'your', 'blessings']
Answer:	['thanks', 'for', 'your', 'blessing']

Test 5/7 ✅
Input: 	['you', 'are', 'smarter', 'than', 'average,', 'but', 'not', 'the', 'smartest']
Answer:	['you', 'be', 'smart', 'than', 'average,', 'but', 'not', 'the', 'smart']

Test 6/7 ✅
Input: 	['two', 'women', ',', 'two', 'oxen', ',', 'two', 'oases', ',', 'two', 'mice']
Answer:	['two', 'woman', ',', 'two', 'ox', ',', 'two', 'oasis', ',', 'two', 'mouse']

Test 7/7 ✅
Input: 	[]
Answer:	[]



[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## Pre-Task 3: Data Cleanning and Data Processing (0 Points)

We will use all the steps performed in the First Assignment. Text Cleanning and Text Processing, including the Bonus Point that will split the Hashtags that are composed by multiple words.

For this Pre-task you only have to run the cells, DON'T modify the code.

In [None]:
def preprocessTweet(tweet):
  tweet = re.sub('http[s]?://[\S]+', ' ', tweet)              # Remove URLs
  tweet = re.sub('[\w]+([._-]\w+)*@\w+([.]\w+)*', ' ', tweet) # Remove e-mails
  tweet = re.sub('@\S+','', tweet)                            # Remove mentions
  tweet = re.sub('\s+', ' ', tweet)                           # Replace repeated spaces to 1 single space
  return tweet

In [None]:
def cleanTokens(tokens):
  newTokens = []
  for token in tokens:
    token = token.lower()
    if re.match('^[_*#!$@<=^`>%&\'\"/()\[\]\-+,.:;?]$', token): # Remove tokens that are 1 single punctuation
      continue
    if re.match('\d+', token): # Remove Numbers
      continue
    if re.match('#[\w\d]+', token): # Remove Hashtag
      token = token[1:]
    newTokens.append(token)
  return newTokens


In [None]:
def splitTokens(tokens):
  splitPattern = r'(?<=[a-z])(?=[A-Z])'
  newTokens = []
  for token in tokens:
    pieces = re.split(splitPattern, token)
    newTokens.extend(pieces)
  return newTokens

**Warning:** Order of the operations really matters

In [None]:
from nltk.tokenize import TweetTokenizer

def tokenizeTweet(tweet):
  tokens = TweetTokenizer().tokenize(tweet)
  splittedTokens = splitTokens(tokens)
  cleanedTokens = cleanTokens(splittedTokens)
  lemmatizedTokens = lemmatizeTokens(cleanedTokens)
  return lemmatizedTokens

In [None]:
from nltk.corpus import stopwords

englishStopWords = stopwords.words('english')

In [None]:
from nltk.corpus import twitter_samples

sampleSet = twitter_samples.strings('positive_tweets.json')

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

counter = CountVectorizer(
    preprocessor = preprocessTweet,
    stop_words = englishStopWords,
    tokenizer = tokenizeTweet,
    max_features = 900,
  )

In [None]:
allBagsOfWord = counter.fit_transform(sampleSet)
vocab = list(counter.get_feature_names_out())

print('\nIMPORTANT !!!!\n\n')
print('This is the vocabulary that you will need to read in order to finish Task 3 succesfully:\n')
print(vocab)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Dow


IMPORTANT !!!!


This is the vocabulary that you will need to read in order to finish Task 3 succesfully:

['):', '..', '...', ':)', ':-)', ':d', ':p', ';)', ';-)', '<3', '\\', 'able', 'absolutely', 'account', 'act', 'active', 'actually', 'add', 'address', 'advice', 'af', 'afternoon', 'ago', 'agree', 'ah', 'ahh', "ain't", 'air', 'al', 'album', 'alien', 'allah', 'almost', 'along', 'already', 'alright', 'also', 'always', 'amaze', 'amazing', 'among', 'android', 'anniversary', 'another', 'answer', 'anyone', 'anything', 'anytime', 'anyway', 'apology', 'app', 'apparently', 'apply', 'appreciate', 'aqui', 'around', 'arrive', 'art', 'article', 'artist', 'asap', 'asian', 'ask', 'asleep', 'august', 'australia', 'available', 'awake', 'away', 'awesome', 'aww', 'awww', 'awwww', 'b', 'babe', 'baby', 'back', 'bad', 'bae', 'bajrangi', 'ball', 'bam', 'bath', 'bc', 'bday', 'beat', 'beautiful', 'beauty', 'become', 'bed', 'believe', 'best', 'bestfriend', 'bet', 'bhaijaan', 'bi0', 'big', 'birthday', 'bit',

## Task 3: Choose your own Stop Words (15 points)

**🎯Task 3.1:** Observe the output of the previous cell and choose minimum 10 tokens from that output that you consider should be considered as **stop words** for this specific solution. Choose words that are completly useless determining if a tweet is positive or negative.

**🎯Task 3.2:** You must also choose minimum 10 tokens that you consider that should be cleaned. These tokens are not stop words, but tokens that doesn't mean anything relevant. Make sure that these tokens are not emojis before writting them here.

In [None]:
# Task 3.1

myStopWords = [
  ##########     Write your code here     ##########
  'also' , 'else', 'able', 'account', 'advice', 'ago', 'almost', 'already', 'anyone', 'anything'
  ##################################################
]

In [None]:
# Task 3.2

myMeaninglessTokens = [
  ##########     Write your code here     ##########
  '\\', '..', '...', '—', '-', '’', '"', 'yo', 'hmm', 'aha', 'umm', 'tgif', 'e', 'r'
  ##################################################
]

⁉️ **Question** 🧐

Why did you decided to add those words?
```
I have choosen those stop words as they
are genereic and do not contribute much
to determing if a tweet is positive or negative.

As for the tokens that I believe that should be
cleaned, I choose them because they are informal english.
They are mostly used in social media but they can carry
spelling mistakes as well as they are too generic.
```



## Task 4: Prepare Data (15 Points)

Data Preparation before training is composed of 7 steps, each one represented by a letter, from A to G. There is a cell down bellow where you must explain the code of steps F and G.

**🎯Task:** Explain the code in your own words. Be clear and precise. You don't have to extend too much, but you must demonstrate that you understand the code.

**⚠️ Dependencies:** This task depends on **Task 1, Task 2 and Task 3**. In case that you couldn't finish one of them succesfully, you can answer this Task just reading the code. If you want to answer this Task without having task 1,2 or 3, you can just reset the code of those cells to the original code, if those cells fail, this will fail.

#### (A) Merge English Stop words and the words defined by the student in a single list

In [None]:
allStopWords = []
allStopWords.extend(englishStopWords)
allStopWords.extend(myStopWords)
allStopWords.extend(myMeaninglessTokens)

#### (B) Load the Sample Dataset using NLTK

In [None]:
from nltk.corpus import twitter_samples

positiveTweets = twitter_samples.strings('positive_tweets.json')
negativeTweets = twitter_samples.strings('negative_tweets.json')

#### (C) Merge Positive and Negative Tweets in a single list

In [None]:
listAllTweets = []
listAllTweets.extend(positiveTweets)
listAllTweets.extend(negativeTweets)

#### (D) Build the Count Vectorizer, defining the Text Cleaning and Text Processing steps

As you can observe in the parameter max_features, we will take the 840 most relevant tokens. This will limit the vocabulary so that we have only the most relevant words. Reducing the dimensions of the input data helps the model to prevent overfitting with specific words.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

counter = CountVectorizer(
    preprocessor = preprocessTweet,
    stop_words = allStopWords,
    tokenizer = tokenizeTweet,
    max_features = 840,
  )

#### (E) Transform the List of Tweets into a Group of Bag of Words

These Bag of Words are stored in a single Sparse Matrix, where each column represents a word in the vocabulary, and each row represent one tweet.


In [None]:
allTweets = counter.fit_transform(listAllTweets)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Dow

#### (F) Task 4.1 ⚠️

**🎯Task:** Explain the code in your own words. What are the labels? How can Numpy help us to generate the labels? Explain each line. Be precise: not very long answers, but explain everything in a clear way.

**🧩Hint:** You can go through the official documentation.

**⚠️Warning:** Write your explanation in the cell that is down bellow. Its title is ⁉️ Question (20 Points) 🧐

**Numpy Ones** [Documentation](https://numpy.org/doc/stable/reference/generated/numpy.ones.html)  
**Numpy Zeros** [Documentation](https://numpy.org/doc/stable/reference/generated/numpy.zeros.html)  
**Numpy Horizontal Stack** [Documentation](https://numpy.org/doc/stable/reference/generated/numpy.hstack.html)  


In [None]:
######          Task 4: Part I         ######

sizePositive = len(positiveTweets)
sizeNegative = len(negativeTweets)

positiveLabels = np.ones(sizePositive)
negativeLabels = np.zeros(sizeNegative)
allLabels = np.hstack((positiveLabels,negativeLabels))

######     Don't modify, only read     ######

#### (G) Task 4.2 ⚠️

**🎯Task:** Explain the code in your own words. Explain the different parameters and outputs of the function. What's doing the function? What's returning?

**🧩Hint:** You can go through the official documentation.


**⚠️Warning:** Write your explanation in the cell that is down bellow. Its title is ⁉️ Question (20 Points) 🧐.

**SKLearn Train/Test Split** [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)  

In [None]:
######          Task 4: Part II         ######

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(allTweets, allLabels, shuffle=True)

######     Don't modify, only read     ######

In [None]:
#@title 💡 Hint for Task 4.2
#@markdown ⚡ Run this cell to see the shape of X_train, X_test, y_train and y_test


print('Shape X_train:\t', X_train.shape)
print('Shape y_train:\t', y_train.shape)
print('Shape X_test:\t', X_test.shape)
print('Shape y_test:\t', y_test.shape)

Shape X_train:	 (7500, 840)
Shape y_train:	 (7500,)
Shape X_test:	 (2500, 840)
Shape y_test:	 (2500,)


⁉️ **Question (20 Points)** 🧐

Part 1 ⚠️
```
The code is used to generate labels for
positive and negative tweets.
Numpy is used to create an array of labels
that correspond to positve and negative tweets.

The first two lines are used to determine the
number of positive and negative tweets.
"sizePositive" and "sizeNegative" store the
length of positive and negative tweets arrays,
which represent the number og positive and
negative tweets.

"np.ones" allows to create an array of
size "sizePositive" filled with ones exclusively.
This array represents the labels for positive tweets.
Similarly we use "np.zeros" which allows to create an array of
size "sizeNegative" filled with zeros exclusively.
This array represents the labels for negative tweets.

In the last line we use "np.hstack" to concatenate
the positive and negative label arrays into a single array.
The resulting array contains labels for both positive and negative
tweets.

```

Part 2 ⚠️
```
The first line of code imports a specific function,
'train_test_split' from the 'model_selection' module
from the 'sklearn' library.

'train_test_split' take in two arrays, 'allTweets' and
'allLabels', which contain data for training a machine
learning model. The 'shuffle' parameter is set to 'True',
meaning that the function will randomly shuffle the data
before splitting it into training and testing sets.

Function returns four arrays, 'x_train' and 'y_train',
which contain data that will be used to train the model.
'x_test' and 'y_test' contains the data that will be used
to evaluate the performance of the model.
```


## Task 5: Train Model and Define good threshold (40 Points)

**🎯Task 5.1:** Try different values for Threshold and leave the threshold that you consider it's better for the model. What can you observe? What could be the reason for this?

**🎯Task 5.2:** Evaluate the model writting your own tweets.

**⚠️ Dependencies:** This task depends on **Task 1, Task 2 and Task 3**. In case that you couldn't finish one of them succesfully, you can answer this Task just reading the code. If you want to answer this Task without having task 1, 2, or 3, you can just reset the code of those cells to the original code, if those cells fail, this will fail.

In [None]:
from sklearn.linear_model import LogisticRegressionCV

In [None]:
model = LogisticRegressionCV(max_iter=2000)
model.fit(X_train,y_train)
print('The training of the Logistic Regression Model has finished :)')

The training of the Logistic Regression Model has finished :)


In [None]:
#@markdown Run this cell to get the accuracy in both datasets (Train and Test) using the threshold defined by you. Remember that threshold goes from 0 to 1. Threshold=0.5 is the default treshold, however, we can't assume that this is the best threshold.

threshold =   0.5#@param {type:"number"}

def getPrediction(probabilities, threshold):
  predictions = []
  for prob in probabilities:
    if prob >= threshold:
      pred = 1
    else:
      pred = 0
    predictions.append(pred)
  return predictions

testProbabilities = model.predict_proba(X_test)[:,1]
testPredictions = getPrediction(testProbabilities, threshold)

trainProbabilities = model.predict_proba(X_train)[:,1]
trainPredictions = getPrediction(trainProbabilities, threshold)

import sklearn

trainAccuracy = sklearn.metrics.accuracy_score(y_train, trainPredictions)
testAccuracy = sklearn.metrics.accuracy_score(y_test, testPredictions)

trainAccuracy *= 100
testAccuracy *= 100

print(f'Train Accuracy: {trainAccuracy:.2f}%')
print(f'Test Accuracy: {testAccuracy:.2f}%')

Train Accuracy: 99.96%
Test Accuracy: 99.64%


⁉️ **Question (Task 5.1)** 🧐

What happen when you try different values? Is this good or bad? Any explanation for this?

**🧩Hint:** Try with values very close to zero ( 0.001 , 0.0001 ), values in the middle ( 0.2 , 0.5 , 0.7 ), and values very close to one ( 0.999 , 0.9999)

```
The accuracy of the model can be affected
in numerous ways by experimenting with different
threshold levels. The model's decision rule for
categorizing data points into positive or negative
classifications is fundamentally changed when we alter
the threshold.

The model may become overly stringent and categorize
fewer data points as positive, leading to reduced accuracy,
if we set the threshold too high. In contrast, if we set the
threshold too low, the model might categorize too many data
points as positive, which could lead to higher accuracy but
also lower precision or more false positives.
```

**Optional Task for Extra Points:** You can also analyze the precision and recall (a.k.a sensitivity) with different thresholds. The number of extra points will depend on your analysis.

In [None]:
from sklearn.metrics import classification_report

print(f'\t((  Classification Report for Test Set  ))\n')

y_test_pred = model.predict(X_test)
print(classification_report(y_true=y_test , y_pred=y_test_pred))

	((  Classification Report for Test Set  ))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      1265
         1.0       1.00      1.00      1.00      1235

    accuracy                           1.00      2500
   macro avg       1.00      1.00      1.00      2500
weighted avg       1.00      1.00      1.00      2500



### Evaluate your own Tweets (Task 5.2) ⚠️

For this task you must build 4 different tweets using words from the vocabulary. You can see the words in the vocabulary if you run the following cell.

**⚠️ Warning:** Remember to RESET the treshold to 0.5 and run the cells again.

In [None]:
print(vocab)

['):', '..', '...', ':)', ':-)', ':d', ':p', ';)', ';-)', '<3', '\\', 'able', 'absolutely', 'account', 'act', 'active', 'actually', 'add', 'address', 'advice', 'af', 'afternoon', 'ago', 'agree', 'ah', 'ahh', "ain't", 'air', 'al', 'album', 'alien', 'allah', 'almost', 'along', 'already', 'alright', 'also', 'always', 'amaze', 'amazing', 'among', 'android', 'anniversary', 'another', 'answer', 'anyone', 'anything', 'anytime', 'anyway', 'apology', 'app', 'apparently', 'apply', 'appreciate', 'aqui', 'around', 'arrive', 'art', 'article', 'artist', 'asap', 'asian', 'ask', 'asleep', 'august', 'australia', 'available', 'awake', 'away', 'awesome', 'aww', 'awww', 'awwww', 'b', 'babe', 'baby', 'back', 'bad', 'bae', 'bajrangi', 'ball', 'bam', 'bath', 'bc', 'bday', 'beat', 'beautiful', 'beauty', 'become', 'bed', 'believe', 'best', 'bestfriend', 'bet', 'bhaijaan', 'bi0', 'big', 'birthday', 'bit', 'bless', 'blog', 'blue', 'body', 'book', 'bore', 'bot', 'boy', 'brain', 'brand', 'break', 'brilliant', 'bri

In [None]:
def predictTweet(tweet):
  tweets = [tweet]
  tweets = counter.transform(tweets)
  return model.predict(tweets)

def seePrediction(tweet):
  if predictTweet(tweet)[0] == 1:
    print('Your tweet was classified as Positive')
  else:
    print('Your tweet was classified as Negative')

In [None]:
###############      Write your code here      ###############
positiveTweet1 = '''
        The sunset was so beautiful tonight <3
'''
##############################################################

seePrediction(positiveTweet1)

Your tweet was classified as Positive


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [None]:
###############      Write your code here      ###############
positiveTweet2 = '''
        Thanks to the unfollowers I'm gonna review life #worth
'''
##############################################################

seePrediction(positiveTweet2)

Your tweet was classified as Negative


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [None]:
###############      Write your code here      ###############
negativeTweet1 = '''
        This rainy weather is really starting to get me down..
'''
##############################################################

seePrediction(negativeTweet1)

Your tweet was classified as Negative


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [None]:
###############      Write your code here      ###############
negativeTweet2 = '''
        I wish twitter cared the same way as google does 👉👈
'''
##############################################################

seePrediction(negativeTweet2)

Your tweet was classified as Positive


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
